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Abstract 

We propose a novel high-dimensional linear regression estimator: the Discrete Dantzig 
Selector, which minimizes the number of nonzero regression coefficients subject to a budget 
on the maximal absolute correlation between the features and residuals. Motivated by the 
significant advances in integer optimization over the past 10-15 years, we present a Mixed 
Integer Linear Optimization (MILD) approach to obtain certifiably optimal global solutions to 
this nonconvex optimization problem. The current state of algorithmics in integer optimiza¬ 
tion makes our proposal substantially more computationally attractive than the least squares 
subset selection framework based on integer quadratic optimization, recently proposed in [8] 
and the continuous nonconvex quadratic optimization framework of |33j . We propose new 
discrete first-order methods, which when paired with state-of-the-art MILO solvers, lead to 
good solutions for the Discrete Dantzig Selector problem for a given computational budget. 

We illustrate that our integrated approach provides globally optimal solutions in significantly 
shorter computation times, when compared to off-the-shelf MILD solvers. We demonstrate 
both theoretically and empirically that in a wide range of regimes the statistical properties 
of the Discrete Dantzig Selector are superior to those of popular £i-based approaches. We 
illustrate that our approach can handle problem instances with p = 10,000 features with cer¬ 
tifiable optimality making it a highly scalable combinatorial variable selection approach in 
sparse linear modeling. 


1 Introduction 

We consider the familiar linear regression framework, with response vector y G model 

matrix X G regression coefficients (3 G and errors e G y = X/3-|-e. We assume, 
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unless otherwise mentioned, that the columns of X, denoted by Xj for j = 1,... ,p, have been 
standardized to have zero means and unit t' 2 -norm. In many modern statistical applications, the 
number of variables, p, is larger than the number of observations, n. In such cases, to carry out 
statistically meaningful estimation, it is often assumed that the number of nonzero elements in 
f3 is quite small 1271. The task is to obtain a good estimate, (3, which is sparse and serves as a 
good approximation to the underlying true regression coefficient. Of course, the basic problem of 
obtaining a sparse model with good data-fidelity is also of interest when the number of observa¬ 
tions is comparable to or larger than p. In the sparse high-dimensional setting described above, 
two estimation approaches that have been very popular among statisticians and researchers in 
related fields are the Lasso m and the Dantzig Selector m- Both estimators can be expressed as 
solutions to convex optimization problems, which can be solved using computationally attractive 
procedures HB [31 EOl [23], and come with strong theoretical guarantees |16L For reasons 

that are explained later in this section, the primary motivation for our investigation in this pa¬ 
per is the Dantzig Selector, which is defined as the solution to the following linear optimization 
problem: 

nun ||/3||i s.t. ||X’^(y - X/3)||oo < <5. (1) 

To distinguish this estimator from our proposed approach, we refer to it as the £i-Dantzig Selector. 
This estimator seeks to minimize the £i-complexity of the coefficient vector, subject to a constraint 
on the maximal absolute correlation between the corresponding residual vector and the predictors. 
The tuning parameter 6 controls the amount of data-fidelity: a small value of S corresponds to a 
good fit, and a larger value of 6 leads to heavy shrinkage of the estimated regression coefficients. 
|I6j point out several reasons as to why the feasibility set in Q might serve as a good measure 
for data-fidelity. In particular, this set is invariant with respect to orthogonal transformations 
on the data (y,X). It can also be shown that 5 control^ the residual sum of squares: the 
latter can be made arbitrarily close to the minimal least-squares value by decreasing 5. The 
.^i-Dantzig Selector, like the Lasso, is used extensively as a model fitting routine to obtain a path 
of sparse linear models, as the data-fidelity parameter is allowed to vary isni, and allows a natural 
extension to more general response distributions [29]. Note that Problem 0 can be rewritten as 
a linear optimization problem and can be solved quite easily for problems with p in the order of 
thousands. Under some mild conditions, and even for p much larger than n, the corresponding 
estimator achieves a loss within a logarithmic factor of the ideal mean squared error achieved if 
the locations of the nonzero coordinates were known mm- 

The £i-Dantzig Selector, however, has limitations. In the presence of highly correlated covariates, 
^More formally, we have: 

||X^(y - X/3)|U > f||y _ x/ 3||2 - jjy - X^LsIla) " , 

2np 2 ' ' 

where, Apmin(X) is the minimum nonzero singular value of X, and /3 ls is any least-squares solution — see Propo¬ 
sition A.l in m for a proof of this result. 
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the estimator tends to choose a dense model, typically bringing in an important variable together 
with its correlated cousins, which does not significantly hurt the £i-norm of the corresponding 
coefficient vector. If one increases the data-fidelity threshold 5, the selected model becomes 
sparser, however, in the process, important variables might get left out. This is largely due to the 
nature of the bias imparted by the £i-norm, which penalizes both large and small coefficients in 
a similar fashion. Similar issues also arise in the case of Lasso [aa ESI sans]. If the £o-pseudo- 
norm is used instead of the ^i-norm, the aforementioned problems can be ameliorated: given 
multiple representations of the model with similar data-fidelity, the ^o-pseudo-norm will always 
prefer the most parsimonious representation. In addition, the £o-pseudo-norm does not shrink 
the regression coefficients: once an important variable enters the model, it comes in unshrunk 
with its full effect, which, in turn, drains the effect of its correlated cousins and naturally leads 
to a sparser model. 

Our Proposal. The preceding discussion suggests a natural question: what if we replace ||/3||i 
in Problem Q with ||/3||o := 7^ 0) ~ f^e number of nonzero entries in /3? This leads to 

the following discrete optimization problem, which also happens to define the estimator that we 
propose: 

min ||/3||o s.t. ||X'^(y - X/3)||oo < (J. (2) 

We refer to the above estimator as the Discrete Dantzig Selector. A couple of questions that may 
be asked at this point are: 

• Is the estimator defined via Problem Q computationally tractable? 

• Does the Discrete Dantzig Selector lead to solutions with superior statistical properties, 
when compared to its counterpart? 

Addressing these questions and answering them affirmatively is the main focus of this paper. 

The objective function in Problem Q, represented by ||/3||i, may be thought of as a convexification 
of the discrete quantity ||/3||o, which counts the number of nonzeros in the regression coefficient 
vector (3. The corresponding estimator seeks solutions with small ^i-complexity. While this 
often leads to sparse solutions, i.e. those with few nonzero coefficients, the sparsity is an indirect 
consequence of minimizing ||/3||i. The Discrete Dantzig Selector on the other hand, targets 
sparsity directly, in its very formulation. Problem Q can be reformulated as a Mixed Integer 
Linear Optimization (MILD) problem — due to the major advances in algorithmic research in 
MILO over the past 10-15 years, these methods are widely considered as a mature technology in 
a subfield of mathematical programming miEH]. Algorithmic advances coupled with hardware 
and software improvements have made MILO problems solvable to certifiable optimality for various 
problem sizes of practical interest. In this sense, it is perhaps appropriate to perceive MILO as 
a computationally tractable tool. The view of computational tractability we adopt here is not 
polynomial time tractability, but the ability of a method to provide high quality solutions with 
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provable optimality certificates for problem types that are encountered in practice, in times that 
are appropriate for the applications being addressed. Our approach is aligned with an intriguing 
recent line of work in computational statistics: the use of Mixed Integer Optimization and, more 
broadly, modern optimization techniques to solve certain classes of discrete problems arising in 
statistical estimation tasks — see, for example, the recent works of mi- Further background 
on MILO appears in Section [2.1[ 

In this paper, we bring together recent advances from diverse areas of modern mathematical 
optimization methods: first-order techniques in convex optimization and MILD techniques. We 
provide a novel unified algorithmic approach that 

(a) performs favorably over standalone of-the-shelf MILD solvers applicable for Problem Q, in 
terms of obtaining good quality solutions with provable certificates of optimality, and 

(b) scales gracefully to problem sizes up to p = 10,000 or even larger. 

In an extensive series of experiments with synthetic and real data we demonstrate that our unified 
approach solves, to global optimality, instances of Problem Q with n ~ 500, p ~ 100 in seconds, 
and underdetermined problems with n ~ 900, p ~ 3,000 in minutes. While it takes marginally 
longer to provide certificates or guarantees of global optimality, the corresponding times are quite 
reasonable: in all the aforementioned instances the certificates of optimality are available within 
an hour. Our approach scales to several instances of problems with n < p and p in the range 5,000 
to 10,000, delivering optimal solutions in approximately an hour and proving optimality within 
at most two days, in all instances. We also find that the statistical properties of our estimates 
are substantially better than those of computationally friendlier alternatives, like the £i-Dantzig 
Selector, in terms of both the estimation error and the variable selection properties. Detailed 
results appear in Section 

Examples. To provide the reader with some intuition, we present a set of three examples, which 
illustrate the differences between the solutions to Problems Q and Q. The following simple 
exampl^ demonstrates how the £i-based method Dantzig Selector might experience difficulty in 
producing a sparse solution in cases where the signal predictors are highly correlated. 

Example 1. Let p = n + 1. Take the first feature as xi = (1,t, .take the (z,j)th entry of 

the feature matrix, X, as xtj = 1 if i = j — 1 and zero otherwise, for i = 1, ...,n and j > 2, and 
set y = Xi - X 2 . 

The £i-norm of the sparse representation of the response, y = xi — X 2 , equals 2. Note that, 
given the available predictors, the response admits only one other exact representation, y = 
rx 3 + ... -|- rxp. The cost for this dense representation is T{n — 1), which is lower than the 
corresponding value for the sparse representation when r is small. Consequently, as long as 
r(n — 1) < 2, both the Lasso and the fi-Dantzig selector select the dense representation of the 

^this example was suggested to us by Emmanuel Candes 
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response. Alternatively, .^o-based methods recover the sparse representation. More specifically, 
consider the solution to Problem Q: if the tuning parameter 6 is set below r/(l + r), then the 
estimator exactly recovers the sparse representation of the response. 


Discrete Dantzig Selector Coefficient Profiles 


Example 1 


Example 1' 


Diabetes data (n=442,p=10) 
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^i-Dantzig Selector Coefficient Profiles 





Figure 1: Coefficient profiles for the Discrete Dantzig Selector and the ^i-Dantzig Selector, as a function 
of the data-hdelity parameter, 6. The dashed vertical lines in the top row of plots indicate the locations 
where the number of active variables changes. [Left Panel] corresponds to Example 1; [Middle Panel] 
corresponds to Example 1'; the “true” nonzero coefficients of +1 and —1 are shown as horizontal starred 
lines. [Right panel] corresponds to the path for the Diabetes dataset. The numbers overlain on the prohles 
indicate the different features. 

Figure demonstrates the difference between the Discrete Dantzig Seleetor and the £i-Dantzig 
selector, by displaying the coefficient profiles for both methods. Note that the profiles for the 
Discrete Dantzig Selector are constructed in a piece-wise constant fashion, where for each given 
model size, the displayed coefficients are taken from the solution corresponding to the lowest 
attainable value of ||X''^(y — X/3)jjoo. The left panel of Figure [^corresponds to Example 1, with 
n = 10, p = 11 and r = l/2p, which we slightly modified by adding noise to the response: y = 
xi — X 2 +e. Ci’s are independently generated from a centered Gaussian distribution, corresponding 
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to the Signal to Noise Ratic0 (SNR) of 1.3 x 10®. 

Now consider Example 1', which is similar in spirit to Example 1. Here, the first two features, 
xi and X 2 , are drawn from a centered bivariate Gaussian distribution with correlation 0.7. The 
remaining p — 2 features are drawn from an independent standard Gaussian ensemble. All the 
features are standardized to have unit .^ 2 -norm, and the response is generated with SNR=1.4x 10®. 
The middle panel in Figure displays the corresponding coefficient profiles with n = 10, p = 12. 
The Discrete Dantzig Selector exactly recovers the true model for a wide range of the tuning 
parameter, 5. As <5 is decreased, and noise variables come into the model, their coefficients 
remain highly shrunk, while the coefficients for the signal variables remain near their true values. 
On the other hand, the £i-Dantzig selector is unable to recover the true model, and produces a 
large estimation error for all values of the tuning parameter. 

The right panel in Figure corresponds to the well known Diabetes dataset [l9] , where n = 442 
and p = 10; here all the variables including the response were standardized to have unit ^ 2 -iiorm 
and zero mean. Note that, despite some similarities, the sequences of predictors entering the 
model are different for the two approaches. 

We note that instead of formulation ([^, one may prefer to minimize the data-fidelity term subject 
to a constraint on the number of nonzeros in (3: 

imn ||X’^(y - X/3)||oo s.t. \\P\\o < k. (3) 

The framework developed in this paper may be adapted to Problem Q. In this paper, however, 
we focus on Problem Q. 

Context and Related Work. A primary motivation of our work is derived from the recent 
work on the least squares variable selection problem [8], where the authors study 

min ^||y-X/3||i s.t. \\l3\\o<k, (4) 

p z 

using mixed integer convex quadratic optimization (MIQO) methods. While certain statistical 
properties of solutions from Problems Q and (|^ are comparable; we observed in our compu¬ 
tational experiments (see Section that Problem Q is orders of magnitude faster (often by 
a factor of hundreds) than Problem (|^ in obtaining solutions with certificates of global opti¬ 
mality. In addition, a MILO formulation for Problem Q consumes much less memory than a 
comparable MIQO formulation for Problem Q. The aforementioned computational superiority 
of MILO over MIQO should not come as a surprise. Indeed, it is quite well known in the integer 
programming community (see, for example, the nice review papers 1281 [H]) that current algo¬ 
rithms for MILO problems are a much more mature technology than MIQO. Recently, [33] proposed 
MIPGO for nonconvex penalized least squares regression based on purely continuous nonconvex 

®For a model generated as -|- Ci, i = 1,..., n; we define SNR as follows: SNR= Var(^)/Var(e). 
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quadratic optimization. We demonstrate the substantial superiority of our proposal over MIPGO 
(in Section 7.1) in obtaining high quality statistical solutions within a given computational bud¬ 
get. 


Thus, the superior computational scalability of the corresponding optimization methods forms a 
principal motivation to study Problem as an effective estimation procedure for sparse linear 
regression. In addition, from a statistical viewpoint, in terms of estimating sparse regression 
models subject to good data-fidelity, the Discrete Dantzig Selector may be perceived as a natural, 
interpretable and useful alternative to least squares with variable selection. Problem Q — in the 
same way as the t'l-Dantzig Selector may be viewed as an appealing alternative to Lasso. 

Contributions. Our main contributions can be summarized as follows: 


1. We propose a new high-dimensional linear regression estimator: the Discrete Dantzig Selec¬ 
tor, which minimizes the number of nonzero regression coefficients, subject to a budget on 
the maximal absolute correlation between the features and the residuals. We show that the 
estimator can be expressed as the solution to a MILO problem, a computationally tractable 
framework that delivers certifiably optimal global solutions; and is computationally more 
scalable than the recently proposed methods in [8l El 03] . 

2. We develop new discrete first-order methods, motivated by recent algorithmic developments 
in first-order continuous convex optimization, to obtain high quality feasible solutions for 
the Discrete Dantzig Selector problem. These solutions are passed onto MILO solvers as 
warm-starts. Our proposal leads to advantages over the off-the-shelf state-of-the-art inte¬ 
ger programming algorithms in terms of (a) obtaining superior upper bounds for a given 
computational budget and (b) aiding MILO solvers in obtaining tighter lower bounds and 
hence improved certificates of optimality. Exploiting problem specific information, we also 
propose enhanced MILO formulations, which further improve the algorithmic performance 
of MILO solvers. 

3. We characterize the statistical properties of the Discrete Dantzig Selector and demonstrate 
both theoretically and empirically its advantages over £i-based approaches. Our results also 
apply to approximate solutions for the Discrete Dantzig Selector optimization problem. 

4. Our approach obtains optimal solutions for p ~ 500 in a few minutes, p ~ 3,000 within 
fifteen minutes and for problems with p = 10,000 in an hour. Certificates of optimality 
are obtained at the expense of higher computation times—for instances with p ~ 500 they 
are obtained within half-hour, for p ~ 3,000 they are achieved around an hour and for 
p = 10,000 the certificates arrive in the range from three to forty hours. To the best of our 
knowledge, we present herein, the largest problem instances in subset selection for which 
certifiably optimal solutions can be obtained. 

Roadmap. The remainder of the paper is organized as follows. Section describes the op- 
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timization methodology behind the proposed approach, and discusses its connections with the 
^i-Dantzig Selector optimization problem. In Section the statistical properties of the Dis¬ 
crete Dantzig Selector are analyzed from a theoretical point of view; the results are compared 
to the £i-Dantzig Selector and the Lasso. The framework of discrete first-order methods is de¬ 
scribed in Section]^ Additional discussion of MILO formulations together with problem specific 
enhancements, is presented in Sectionj^ Section [^gathers numerical results on the computational 
performance of our algorithms in a variety of settings. An empirical analysis of the statistical 
properties of the Discrete Dantzig Selector is conducted in Section Some technical details are 
provided in the Appendix. 


2 Overview of the Proposed Methodology 

Herein, we introduce and summarize the general aspects of the proposed methodology. Further 
details and enhancements are provided in Sections and 

2.1 Mixed Integer Linear Optimization (MILO) Preliminaries 

The general form of a MILO problem is as follows: 

min a'^^Q: 

OL 

s.t. Act < b 

ai G {0,1}, i = 1,... ,mi 
Oij >0, j = mi 1 ,... ,m, 

where a G G and b G are the problem data, the symbol “<” denotes 

element-wise inequalities, and we optimize over (cti,..., am) := ct G M™' containing both discrete 
(ai, i = 1,..., mi) and continuous («*, i = mi + 1,... ,m) variables. For background on MILO, 
we refer the reader to mm- Some modern integer optimization solvers include Cplex, Glpk, 
Gurobi, Knitro, Mosek, Scip — see also [32] . 

As already alluded to in Section]^ there has been significant progress in the theory and practice 
of MILO over the past fifteen to twenty years. Specifically, the computational power of MILO 
solvers has undergone impressive advances over the past twenty-five years — the cumulative 
machine-independent speedup factor in MILO solvers between 1991 and 2015 is estimated to be 
780,000 [To]. This progress can be attributed to the inclusion of both theoretical and practical 
advances into MILO solvers. Some of the main factors responsible for this speedup are advances 
in cutting plane theory, improved heuristic methods, disjunctive programming for branching 
rules, techniques for preprocessing MILOs, using linear optimization as a black box to be called 


Diabetes Dataset (n = 442,p = 64, ||/3||o = 41) 




Figure 2: Typical evolution of a MILD algorithm for Problem (i). as a function of time. [Left Panel] 
displays the progress of Upper Bounds (UB) and Lower Bounds (LB) for the optimal value of the objective 
function. The upper bounds, which correspond to feasible solutions for Problem ([^ are seen to stabilize at 
the global minimum within a few seconds. The lower bounds provide a certificate of how far the current 
solution might be from the global solution — these bounds progressively improve as the MILO algorithm 
explores more nodes in the branch and bound tree. Observe that the certihcate of global optimality arrives 
at a later stage, even though the algorithm finds the global solution very quickly. [Right Panel] displays 
the evolution of the corresponding MILO Optimality Gap (in %), defined as (UB - LB)/UB, with time. 


by MILO solvers, and improved linear optimization methods m- In addition, there have been 
substantial improvements in hardware speed: the overall hardware speedup from 1993 to 2015 is 
approximately estimated to be 10^'^^ ~ 570,000 [1] . When both hardware and software advances 
are combined, the overall speedup for MILO problems is estimated to be around 450 billion! One 
attractive feature of MILO solvers, which is a stark contrast to heuristic approaches, is that the 
former provide (a) feasible solutions, which are also upper bounds to the minimum objective value 
and (b) lower bounds for the optimal value of the objective function. As a MILO solver makes 
its way to the global optimum, the lower bounds become tighter, thereby providing improved 
certificates of sub-optimality (see Figure for an illustration). This aspect of MILO solvers is 
quite useful, especially if one decides to stop the solver before reaching the global optimum. In 
the modern day world, MILO plays a key role in various impactful application areas of operations 
research: revenue management, air-traffic control, scheduling and matching tasks, production 
planning and others da n\. In this paper, we show how the power of MILO can be used in 
the context of a problem of fundamental importance in statistics, namely, sparse linear model 
estimation — we build upon recent line of work in computational statistics, at the interface of 
modern discrete optimization and fundamental techniques in statistical modeling mE]. 
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2.2 MILO formulations for the Discrete Dantzig Selector 


Assuming without loss of generality that Problem (§ has a minimizer which is bounded, it can 
be obtained by solving 

Pi := nun ||/3||o 

s.t. ||XT(y-X/3)||oo <5 (5) 

ll/3||oo<Mc/, 

where, Adu is a large but finite number [7]. We present a MILO formulation for Problem ([^ (and 
also, Problem Q) 

p 

min ^ Zi 

s.t. -6 < dj - < S, j = l,...,p 

-MuZj < Pj < MuZj, j = l,...,p 

where the optimization variables are z (binary) and (3 (continuous); the problem data consists of 
d := X'^^y = (di,..., dp)^ and Qpxp := X^X = [qi,..., q^]. Formulation Q is often referred 
to as a “Big-M” formulation due to the presence of the parameter Aiu- The binary variable Zi 
controls whether fdi is zero or not; if Zi = 0 then /3i = 0 and if Zj = 1 then fdi is free to vary in 
the interval [—Afjj, Af c/]- The objective function controls the number of nonzeros in the 

model. Figurej^shows the performance of the above MILO formulation on the Diabetes dataset m 
with n = 442,p = 64 (here, we mean-centered and scaled y,Xj’s to have unit f' 2 -norm). 

Formulation Q has intriguing connections to the ^i-Dantzig Selector: the binary variables Zi G 
{0,1} in Problem Q can be relaxed into continuous variables Zi G [0,1], leading to: 

^2 := min 7^ E IPjl 

P i=l 

S.t. ||XT(y-X/3)||oo <<5 (7) 

-Mu < fdj < Mu, j = 1,... ,p. 

Problem Q modifies the £i-Dantzig Selector problem: 

F 3 :=min -^||/3||i s.t. ||X’^(y - X/3)||oo < <5, 

Mu 

by including an .^ 00 -constraint on (3. It follows from the above that: Ps < F 2 < Fi. The last 
inequality is typically strict, and, depending upon the data, the gap between the values, as well as 
between the corresponding optimal solutions, can be substantial, as illustrated in Figure The 
above discussion provides another viewpoint for explaining the differences between the £i-Dantzig 
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Selector and the Discrete Dantzig Selector estimators. If Mjj is taken to be large enough then 
Ts = Ta. 

We note that the constraint ||/3||o < k can also be expressed using Specially Ordered Sets [7] as 
an alternative to the “Big-M” formulation ([^. This can be used when the user does not wish to 
specify a-priori any bound Aijj on the regression coefficients and/or the coefficients have widely 
varying amplitudes. We discuss this further in Section 

We emphasize that Aiu appearing in formulation Q should not be interpreted as a statistical 
tuning parameter — it appears from a purely algorithmic viewpoint and, as we saw, has interesting 
connections to the ^i-Dantzig Selector problem. Mu might be taken to be arbitrarily large (but 
finite) to obtain a solution to Problem ([^. In Section]^ we describe several data driven methods 
to estimate Aiu, and we also discuss other structured formulations of Problem Q, which lead to 
improved algorithmic performance: they deliver tighter computational lower bounds in smaller 
amounts of time. We now proceed towards an analysis of the statistical properties of the Discrete 
Dantzig Selector and investigate its comparative advantages over its ^i-counterpart. 


3 Statistical Properties: Theory 

In this section we study the statistical properties of the Discrete Dantzig estimator. In partic¬ 
ular, we characterize its connections with the Best Subset selection estimator in the case of the 
orthonormal design, we investigate its oracle properties in the classical asymptotic regime and, 
finally, we analyze its estimation, prediction and variable selection performance in the high di¬ 
mensional setting. To improve readability, all the technical proofs are presented in Section [A| in 
the Appendix. 

3.1 Orthonormal Design 

Here we assume xjxfc equals 1 \i j = k and 0 otherwise. Note that such an assumption requires 
n > p. Our goal is to compare and connect the Discrete Dantzig estimator, f3, which solves 
the optimization problem ([^, with the Best Subset selection estimator, f3 , which solves Q. 
In particular, we want to understand the relationship between the tuning parameter 5 in the 
Discrete Dantzig optimization problem and the tuning parameter k, which controls the io norm 
of the Best Subset solution. Note that Problem Q does not generally have a unique optimizer, 
so we use (3 to refer to just one of the Discrete Dantzig solutions. 

Define Cj = xjy for j = l,...,p. To simplify the presentation, and without loss of generality, 
suppose that the predictors are indexed in such a way that |ci| > |c 2 | > ... > \cp\. Suppose also 
that Cfc / Cfc+i, to ensure uniqueness of the Best Subset solution. 
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Theorem 1. 1. The Best Subset selection estimator is uniquely defined as 

^BS . , 

= (ci, . . . ,CA:,0, . . . ,0). 

2. Suppose that 6 G [ck+i,Ck)- Then, the set of all the Discrete Dantzig solutions is 
{(ci + ui, ...,Ck + Mfe,0, ...,0), \uj\ < 5,j = 1, ...,k} . 

It follows that each Best Subset estimator is a Discrete Dantzig solution for an appropriately 
chosen 5. The coefficients of both estimators are obtained by the hard thresholding of the co- 
variances Cj, however, the nonzero coefficients of the Discrete Dantzig estimator are allowed to 
deviate from the value of the covariance by an amount bounded above by 6. 

3.2 Fixed p asymptotics 

To avoid confusion, we refer to the true coefficient vector as (3*. In this subsection we treat the 
number of predictors, p, as hxed and let the number of observations, n, tend to inhnity. The 
standard assumption for deriving asymptotic results in this setting is that (3* does not depend 
on n, and converges to a non-singular covariance matrix C. We rewrite the above 

assumption to be consistent with the scaling ||xj ||2 = 1 for j = 1, ...,p, which is used throughout 
this paper. Thus, we require that X^X converges to C as n tends to inhnity, and /3* = , 

for some hxed vector (3 . We also impose a standard assumption that e* are i.i.d. with zero mean 
and hnite variance. Given an index set J C {1,... ,p} we write Xj for the sub-matrix of X that 
consists of the columns identihed by J. Let J* denote the support of the vector (3*. We dehne 
the Oracle estimator, (3 , as the least-squares estimator computed using only the true predictors. 
In other words, the support of (3 equals J*, and (3j* = (Xj*Xj*)“^Xj*y. 

Theorem 2. Let <5 —>• oo and S = o(n^/^). Suppose that matrix C is invertible. Then, with 
probability tending to one, 

1. The support of each solution to the Discrete Dantzig optimization problem equals J*; 

2. Both the true coefficient vector, (3*, and the Oracle estimator, (3 , belong to the set of 
solutions to the Discrete Dantzig optimization problem. 

Consider the polished version of the Discrete Danzig estimator, which is dehned as follows; given 
a Discrete Dantzig solution with support J, the support of the polished estimator, (3 , is set 

^ ^P 

equal to J; on its support (3 is dehned as the least-squares estimator using the corresponding 
predictors, i.e. jSj = (XjXj)“^Xjy. Note that the value of [3 generally depends on the 
the choice of the Discrete Dantzig solution. However, this choice becomes irrelevant under the 
setting of TheoremMore specihcally, with probability tending to one, every polished estimator 
coincides with the Oracle estimator. 
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Corollary 1. 

tending to one. 




Under the assumptions of Theorem equality (3 


(3 holds with probability 


Consequently, estimator (3 satisfies the oracle property in the sense of Fan and Li m- 


3.3 High Dimensional Setting 


Here we focus on the case where p is large, possibly much larger than n. We discuss the prop¬ 
erties of the global as well as approximate solutions to Problem Q. We also comment on the 
estimator obtained from the closely related Problem Q, in which ||/9||o is constrained, rather 
than minimized. We assume that the error terms in the underlying linear model are mean zero 
Gaussian with variance cr^ and, as before, use /3* to refer to the true regression coefficient vector. 
We start with some notation. For every vector 0 € M.P and index set J C {1, ...,p} we write 6j 
for the sub-vector of 6 determined by J. 

Definition 1. Given positive integers k and m, such that m € [k,p — k], and a positive cq, let 


7(/c) = mm ,, — 

0^o,||0||o<fc ll^lb 


K{k, Co) = min 


|X0| 


mm 


K{k, Co, m) = min 


|Jo|<fc 1^07^0, II 0jc 111 <co II 0Jg 111 ll^Jolb^ 

||X0||2 


mm 


|Jo|<fc [^ 07 ^ 0 , ||0jc||i<co||0j(j||i ||0Joi||2 


where Jq C {1, ...,p} and Jqi := Jq U Ji, with Ji identifying the m largest (in magnitude) coordi¬ 
nates of 6 outside of Jo- 

We use s* to denote ||/3*||o- As we discuss in the next subsection, quantities [fi;(s*,co)]~^ and 
[k(s*, Co, for m > s* and co > 1, appear in the error bounds for the Lasso and the original 

Dantzig selector, while [ 7 ( 25 *)]”^ appears in the bounds for Discrete Dantzig Selector. The 
following result establishes some useful relationships for these quantities. 

Proposition 1. For all positive integers k and m, with m G [k,p — k], and all co > 1 the following 
holds: 7(2/0) > k(/c,co)/\/2 and 7(2/0) > K{k,co,m). 


Recall the setting of Example 1. When T{n — 1) < 2, the (.1 methods, such as the original Dantzig 
selector, fail to recover the sparse representation of the response. Note also that k(2,co) = 
«:(2,co,m) = 0, for m > 2 and cq > 1. On the other hand, 7 ( 4 ) > 0 for p > 4, and the Discrete 
Dantzig Selector succeeds in recovering the correct sparse representation, for every sufficiently 
small value of the tuning parameter, 5. 


The following theorem establishes several useful bounds for the Discrete Dantzig Selector. 
Theorem 3. Suppose that (3 solves optimization problem ([^ for 6 = a\j2{\. a) logp, where a > 
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0. The following bounds hold with probability bounded below by 1 — {p^'yJ'K logj?) ^ 


ll3llo 

< 

s* 

ii3-nii 

< 

A[j{2s*)]-^s*6 

ii3-nii 

< 

8[7(2s*)]"S*h2 

n-i||x(3-r)iii 

< 

8[7(2s*)]-2s*52^ 


Remark. It follows from the proof of Theorem that the above result 

(i) holds uniformly over the set {(3* : ||/3*||o < s*}; 

(ii) also holds for the solution to Problem ([^ with k = s*. 

We now compare the above bounds to those established for the popular ^i-based approaches. 
Under the assumed scaling of the predictors, and for every positive integer m, such that m G 
[s*,p — s*], Theorem 7.1 in gives the following error bounds for the £i-Dantzig Selector esti¬ 
mator, 


Pos-riii < 

ll/^DS “/3*ll2 ^ 

n-^\\MPDS-m < 


8[Kis*,l)]-^ s*d 

16 -b y/ s/rnj [k(s*, 1 , m)]~^ 

16 [K(s*,l)]" 2 s*h 2 , 


( 8 ) 


By Proposition [T| the right hand sides of the above inequalities are at least as large as the 
corresponding bounds in Theorem Moreover, the differences in the two sets of bounds can 
potentially be quite signihcant. Consider the setting of Example 1 for illustration. The upper 
bounds in Theorem]^ are hnite for p > 4, while the three bounds in display Q are inhnite. 

Examining Theorem 7.2 in [9], we conclude that the corresponding error bounds for the Lasso 
are at least as large as those given in display ([^. The same result also provides an upper bound 
on the £o-pseudo-norm of the Lasso estimator, /3Lasso- 


11 /^Lasso 110 ^ 


64(^niax * 


where 4’ma.x is the maximum eigenvalue of the matrix X^X. Note that the right-hand side of 
the above bound is inhnite in the setting of Example 1. In general, this upper bound is at 
least 64 times as large as the one for the Diserete Dantzig Selector estimator. We informally 
summarize the above hndings as follows: when compared to the ^i-based approaches, the Dis¬ 
crete Dantzig Selector satishes as good or better estimation and prediction error bounds, while 
achieving signihcantly higher level of sparsity. 

We can sharpen the bounds in Theorem]^ by making them dependent on the support of /3*. More 
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specifically, given an index set J* we define 


1{J*) 


min 

J|=2s*, JD J* 


min 

07^O,0jc=O 



Then, Theoremholds with 7(2s*) replaced by ^{{k : ^ 0}), and the corresponding result is 

uniform over (3*. 


The following corollary to Theorem shows that the Discrete Dantzig Selector successfully recov¬ 
ers the support of the true coefficient vector, provided the nonzero coefficients are appropriately 
bounded away from zero. Define |/3*|min = 7 ^ 0}. 

Corollary 2. //|/3*|min > du [ 7 ( 25 *)]“^ ^(1 + o)s* logp, then the estimator from Theorem^ 
exaetly recovers the support of (3*, with probability bounded below by 1 — (p“\/ 7 rlogp)“^. 

We now consider an estimator (3 that is a feasible solution to the optimization problem Q, but 
not necessarily the optimal solution. Recall that our algorithms produce (3 together with a lower 
bound on the minimum value of the objective function, ||/3||o. We denote this lower bound by 
'slb- The next result shows that if the algorithm is stopped when ||/3||o is within a prespecified 
multiplicative factor of 'slBi the bounds from Theorem continue to hold after an appropriate 
adjustment. The corresponding proof follows the argument in the proof of Theorem making 
only minor modihcations. 

Theorem 4. Suppose that (3 is a feasible solution to the optimization problem Q, corresponding 
to 6 = cry/2(l -|- a) logp, where a > 0, such that ||/3||o < (1 + Then, the following bounds 

hold with probability bounded below by 1 — logp)“^; 


n 


-1 


ll3llo 

ii3-/31Ii 

Ii3-/3iii 

lix(3-/9*)iii 


< (l-t-V')'S* 

< 2{2 + i ;)[^{[2 + f;]s*)]-‘^ s *5 

< A {2 + ' il ;)[^{[2 + i ’] s *)\~^ s * 5 ‘^ 

< 4(2 + ^A)[7([2 + V’]s*)]"^s*5^ 


Note that the constant V' is typically quite small in practice, for example, 0.1 (see the right panel 
in Figure for an illustration of the evolution of over time for the Diabetes dataset.) Thus, 
the corresponding effect on the error bounds is generally minor. 


4 Obtaining Good Solutions via Discrete First-Order Methods 

In this section we propose new algorithms, referred to as discrete first-order methods, which deliver 
good upper bounds for Problem Q. It is important to note that unlike the MILO framework, 
these algorithms do not provide lower bounds. Instead, the solutions obtained by our methods are 
passed to MILO solvers as warm-starts. The proposed algorithms are inspired by recent advances 
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of first-order methods in convex optimization I37l|36l[38], and can be viewed as their nonconvex 
adaptations. We summarize their key advantages: 

• They provide excellent upper bounds to Problem ([^ with low computational cost, time and 
memory requirements. 

• MILD solvers accept these solutions as warm-starts and consequently improve upon them. 
This hybrid approach outperforms the stand-alone capabilities of an off-the-shelf MILO solver, 
producing high quality upper bounds in amounts of time that are orders of magnitude 
smaller. 

• The solutions obtained can be used to improve the overall run-time of MILO solvers, including 
certificates of global optimality. 

We validate our proposed methods on several synthetic and real-data datasets. 


4.1 Discrete First-Order Methods 


Problem Q involves the minimization of a discontinuous objective function over a polyhedral 
set. Thus, it is not directly amenable to simple proximal gradient type algorithms [Ml EH [8]. 


We propose two algorithms: Algorithm 1 (see Section 4.1.1) and Algorithm 2 (see Section 4.1.2), 


both of which can be used as stand-alone solvers for obtaining good quality upper bounds to 
Problem (§. We also present a hybrid method. Algorithm 3, which combines the strengths of 
both Algorithms 1 and 2, by using the solution obtained from Algorithm 1 as an initialization to 
Algorithm 2. In our experiments Algorithm 3 showed the best empirical performance. Section 
presents numerical results illustrating the performance of our framework. We emphasize that 
Algorithms 1—3 only provide good upper bounds, they do not certify the quality of solutions via 
lower bounds. A main purpose of these algorithms is to provide good quality upper bounds to 
initialize the MILO solvers — the latter, in turn, are often found to improve upon the upper bounds 
obtained from these first-order algorithms, at the cost of more (but still reasonable) computation 
times. 


4.1.1 The Variable Splitting Method 

We present our first discrete first-order method based on a classical method in nonlinear opti¬ 
mization: the Alternating Direction Method of Multipliers (aka ADMM) [3] popularly used in 
the context of convex optimization-we refer the reader to [l2j for a nice exposition on this topic. 
We choose this method because of its simplicity and good performance in practice, as seen in our 
experiments. To apply this algorithm, we decouple the feasible set {/3 : ||X^(y — X/3)||oo < 5} and 
the discontinuous function (3 i—)• ||/3||o. Observe that Problem ([^ can be equivalently rewritten 
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as: 


min ||/3||o 

ot,p 

s.t. ||XT(y-Xa)||oo <5 (9) 

a = p. 

We consider the Augmented Lagrangian given by: 

Cx{P,a-,u) := ||/3||o + ^||/3-Q:||i + (i^,a-/3) (10) 

for some value of A > 0, where, u may be thought of as a “dual” variabl^ that along with A 
controls the proximity between a and /3. The ADMM procedure leads to the following update 
sequence: 


Pk+i Gargmin £a(/ 3, ctfc; i^fc) 

(11) 

Qfc+I G argmin CxiPk+i^ot'^t^k) 

iy-Xa)\\^<S 

(12) 

^k+1 =^k + {pLk+l — fik+\) ■ 

(13) 


Step © can be performed via a hard thresholding operation [18] and ( |12[ ) involves a simple 
projection onto the polyhedron {ct : ||X''^(y — Xq:)||oo < <5}, which can be done efficiently, as 
detailed in Section B.lj (Appendix). 


Algorithm 1 

(1.) Input {Pi,ai,ui), choose A > 0, and repeat the following steps until convergence. 

(2.) Update (/3fc, Qfc, i/fc) to (/3fc+i, a^+i, i^fc+i) via 

(3.) Stop if \\Pk+i-Pkh < TiWPkh anc^||/3fc -Q:fc ||2 < r 2 max{||/3fc||2, Hafclb}, otherwise go to 
Step 2. 


We found Algorithm 1 to work quite well in our experiments. The algorithm may be sensitive to 
the choice of A — affecting the solution and the time until convergence. We recommend using 


multiple values of A, and choosing the best solution among them. In Section 4.1.3 we address 
these shortcomings and describe modifications that lead to improvements in practice. 


■^Following the terminology in m, if instead of ||/9||o we had a convex function, then v would be a dual variable, 
and its corresponding update step (131 would be the dual update. We will with a slight abuse of terminology use 
the term “dual” here. 

®Here, ti,T 2 are tolerances for convergence, typically, taken to be equal and set to 10“'*. 
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4.1.2 Sequential Linear Optimization 


We now describe another nonlinear optimization algorithm for obtaining upper bounds for Prob¬ 
lem ([^, motivated by ideas popularly used in nonconvex penalized regression (see, for example, 
[35j and references therein). Let us consider a family of nonconvex functions, p^{\l3\), parametrized 
by 7 G ( 0 , 7 ], such that 7 = 7 corresponds to P'y{\j3\) = |/3|, and, as 7 decreases to 0 , P'y{\l3\) 
becomes a progressively better approximation to l(/3 7 ^ 0). In other words. 


ll/3||o = lim P 7 (|A|). 

7 ^ 0 + 

2 = 1 


(14) 


We make the following assumption about P'y{-): 

Assumption (A): p^{(3) is symmetric in (5 around zero and continuous. Let P'y{P) > 0 and 
/?-y(0) = 0. For every 7 , the map \ j3\ 1 —>■ P'y{\l3\) is concave and differentiable on (0, 00 ). 

Some popular choices of p^{-) are p^{t) = log(^-|-1)/log(^-|-1) and p^{t) = t'*' for t > 0. We refer 
the reader to |35j (and references therein) for more context and examples of nonconvex penalty 
functions used in sparse linear regression. 

We propose to compute good upper bounds for the following continuous nonconvex optimization 
problem: 


min /i(/3) := ^/ 9 ^(| 7 |) 

^ i=l 


(15) 


s.t. 


xT(y-X/3)||oo <5, 


especially for 7 ss 0-|-. In light of (14), this leads to good upper bounds for Problem ([^. Note 
that the concavity of \j3\ 1 —)• P'y{\l3\) leads to the following upper bound (for all j3 and /3): 


HP) = YpH\H\) 

2=1 

<Ep7ifti)+E7(ifti)(ifti-ifti). 


(16) 


2=1 


2 = 1 


:=h{(3-,(3) 


where p'^{-) denotes the derivative of |/3| 1 —>■ p^{\f3\), with the convention that p'^{0) = 00 if the 
derivative is unbounded as \ j3\ —)• 0-|-. Inequality (16) suggests that we sequentially minimize an 
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upper bound to the objective function in (15). This leads to the following iterative scheme: 




argmin E 

/3 i=l 

S.t. ||XT(y-X/3)||oo <<5, 


(17) 


where we assume, without loss of generality, that (3^ is feasible for Problem ( |17[ ). The above 
sequential approximation of the function h{(3) is similar to the popular reweighted t'l-minimization 
method, used in signal processing m for sparse linear model estimation. 


We now present a simple finite time convergence rate of the iterative process 0 in terms of 
reaching an approximate first order stationary point of Problem (15). Towards this end, we 
introduce the following quantity: 


A(0) := 


min E (lAI - l^il) 

^ 2=1 

s.t. ||XT(y-X/3)||oo <5, 


(18) 


which we use to define a first-order stationary point for Problem (15). 

Definition 2. Suppose 6 is feasible for Problem (15). We say that 0 is a first-order stationary 


point for Problem (15), if A{6) = 0. For > 0, 6 is said to be an accurate first-order stationary 


point if A{6) > — fi. 

We note that A(0) (assuming that 6 is feasible for Problem (|15[)) is a measure of how far 6 is 


from a first order stationary point of Problem (15)-if A(0) < 0, then the current estimate 9 can 


be improved, if A(0) = 0 then the solution cannot be improved via update 0- We refer the 


reader to Section B.4 for a more detailed explanation. The following theorem (for a proof see 
Section B.4) presents a finite time convergence rate analysis of the sequence 0 to a first-order 


stationary point for Problem (15). 

Theorem 5. Consider Problem 0 for a fixed 7 > 0 , with Assumption (A) on Py{-) in place. 
The update sequence (3^, defined via 0 , leads to a decreasing sequence of objective values for 
Problem (15).' /i(/3^“'“^) < h{(3^) for all k >1. In addition, for every /C > 0 we have the following 


finite-time convergence rate: 


mm 

i<k<K: 


{-A(/3^)}<i(h(/3i)-h), 


where the sequence of objective function values satisfies h{P^) f h as k ^ 00 . 


We emphasize that the result in Theorem pertains to the performance of the sequence ( |17| ) as 
a numerical optimization scheme, and has no direct implication on the statistical properties of 
the sequence. Theorem implies that for any iji > 0, it takes at most JC = O(^) many iterations 
to reach a ^-accurate first-order stationary point, i.e., there exists a 1 < k* < 1C such that 
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A(/3'= ) > —(j). The sequence /3^ leads to an estimate (3^, an upper bound for Problem (flM for a 


fixed 7 . Since our intent is to obtain a good solution to Problem Q, we make use of property (14). 
This suggests that we obtain a good upper bound to Problem (15) for a small value of 7 « 0 +. 


Instead of applying iteration (0 for a pre-specified (small) value of 7 , we recommend using a 
continuation strategy in practice. We take a sequence of decreasing values of 7 G { 71 ,... , 77 v}, 
where 7 * > 7 i+i. We use f3^. as a warm-start for obtaining a good solution (upper bound) to 


Problem (15) for a smaller value of 7 = 7 i+i. In our numerical experiments, this continuation 
strategy seems to work well. The method is summarized below. 

Algorithm 2 

(1.) Take a decreasing sequence of 7 values { 71 ,..., 7Ar}; initialize with = 0; and fix a value 
of Tol = 10“^ (say). Set k = 1 and 7 = 7 ^. 

(2.) Use the update sequence rule ( [Tt] ) until some convergence criterion is met: (—A(/3^)) < Tol. 
Let (3^ denote the estimate of (3, upon convergence. 

(3.) Set K ■<— K + 1, 7 = 7 k and j3^ (3^. li k, < N, then goto Step 2. If k > N, exit with (3^ 

as an upper bound to Problem (§. 

The linear optimization Problem 0 can be solved quite efficiently using simplex methods. For 
larger problems, i.e. p larger than a few thousand, we recommend using modern first-order 
method as described in Section |B.2 Since Algorithm 2 requires solving several instances of 
related problems of the form 0, the warm-start capabilities of simplex methods and first-order 
methods lead to computational benefits. 


4.1.3 Algorithm 3: Combining the Strengths of Algorithm 1 and Algorithm 2 


In our empirical studies we observed that Algorithm 1 is more effective in obtaining good upper 
bounds than Algorithm 2 for a given time limit. Algorithm 2 , on the other hand, has stronger 
convergence guarantees than Algorithm I. Algorithm 1 leads to an estimate of (3 that is sparse 
but approximately satishe^the feasibility constraint of Problem ([^. Algorithm 2, in contrast, 
leads to solutions that are both sparse and feasible — these advantages make Algorithm 2 an 
important tool in our framework. We propose to combine the best features of Algorithms 1 and 
2 to develop a hybrid variant: Algorithm 3, which we recommend to use in practice. Algorithm 3 
is simple but very effective: it uses the solution obtained from Algorithm I, say, /3^ \ to create a 
set X C {1,... ,p}, which includes the nonzeros in \ and then applies Algorithm 2 on this set 


the details of this method are presented in Section B.5 (in the Appendix). 


®This is because Algorithm 1 delivers a pair, a, (3, which are approximately equal: a « /3; a is feasible for 
Problem § but need not be exactly sparse; /3, on the other hand, is sparse but approximately feasible. 
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5 Structured MILO Formulations and Certificates of Optimality 


This section is dedicated to enhancements of the basic Discrete Dantzig Selector formulation Q, 
presented in Section [2.2[ These are particularly useful in delivering tighter lower bounds, thereby 
providing certificates of global optimality in shorter times. 

Note that formulation Q requires the specification of Mu large enough to include the solution 
of Discrete Dantzig Selector. We mention another MILO formulation for Problem Q, based on 
Specially Ordered Sets [7]. We introduce binary variables Zi G {0,1}, which satisfy the condition 
(1 — Zi)l3i = 0 for alH = 1 ,... — in other words, if Zi = 0, then (3i = 0, and if Zi = 1, then Pi is 

unconstrained. This condition can be modeled via integer optimization using Specially Ordered 
Sets of Type 1 (SOS-1). More specifically, 


{l-Zi)Pi = 0 (/3j, 1 - Zj) : SOS-1, 


for every i = 1,... ,p. This leads to the following MILO formulation for Problem ([^: 


p 


min 



/3,z 

2 _ 


S.t. 

-5 <dj - <5 j = l,. 

■ ■ ,P 


{Pj,l- Zj) : SOS-1 j = l,. 

■ ■ ,P 


Zj£ {0,1} j = l,-' 

■ ■ ,P, 


(19) 


where, we use the notation as used in Problem Q. Observe that unlike ([^, Problem (19) 


does not contain any parameter Mu in its formulation. Problem (19) may be preferred over 
Problem (1^ when the different nonzero values of |/3i|’s have widely different amplitudes. In 


general, however, we found empirically that the algorithmic performances of formulations (19) 


and (j^ are comparable. The MILO formulations (|^ and (19) are found to work quite well in 
obtaining good upper bounds for up to p = 10,000, once they are warm-started via the discrete 


hrst-order methods described in Section 4.1 If additional problem-specific information which 


we refer to as “intelligence” is supplied to the MILO formulations (^6^ and (19), the results are 


found to improve substantially — as shown in Section 6.2 More specifically, we use the term 
“intelligence” to broadly refer to two components: 

(a) providing an advanced warm-start to the MILO solver, obtained via our discrete first-order 
methods 


(b) arming the MILO solver with information in the form of interval bounds on the regression 
coefficients Pj, predictions xI/3, and also bounds on ||/3||i and ||X/3||i. 

We note that the resulting formulation with the additional bounds as suggested in (b), above, 
should lead to a solution for Problem ([^. We, thus, present the following structured version of 
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formulation Q: 


V 


min 

/3,z 

i=l 




S.t. 

— 6 < dj — < 6 

3 = 1; 

,...,p 

(20a) 


— M{jZj < Pj < M{jZj 

3 = 1; 

,... ,p 

(20b) 


Zj G {0,1} 

3 = 1: 

,... ,p 



II 



(20c) 


-M{j<Pj<M\j 

3 = 1; 

,...,p 

(20d) 


- <ii< M\f 

i = 1, 

...,n 

(20e) 







n 

i=l 



(20f) 


where the optimization variables are f3 G MP, z G {0,1}^, ^ G MP, and the parameters A4^ 

control, respectively, upper bounds on |/3j|, |(xj,/3)|, ||/9||i and ||X/3||i. We note that the parame¬ 
ter Mu in Problem Q is such that Aiu > maxj for j = 1,... ,p. Problem (20) is equivalent 
to the following constrained version of Problem (j^: 


nun ||/3||o 

S.t. ||XT(y-X/3)||oo <5 

-M^jj < I3j < M\j j = l,...,p 

|(xi,/3)| < z = 1,... ,n 


Section ^ presents several strategies to compute these parameters such that a solution to Prob¬ 
lem (21) is also a solution to Problem ([^. 


We present a few variations of formulation (20) that might be preferred from a computational 


viewpoint, depending upon the problem instance under consideration. For large values of p and 


n (approximately a few thousand), the constraints appearing in (20a) and (20c) may be replaced 
by: 

-5 < dj - (xj, <6, ^ = X/3, 

where, X^X = X^X and X is triangular — this leads to a sparse representation of the constraints 


appearing in (20). When n is large and p is smaller, it may be useful to perform a variable 


reduction by removing the variable ^ from (20). This will replace constraint (20a) by — 5 < 
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~ constraints (20c), (20e) and (20f) will be dropped. Constraints (20b) imply 


the bounds indicated in the constraints (20d), hence the constraints (20d) may be dropped in 


favor of a formulation with fewer constraints. 


We note that formulation (20) is an optimization problem with many more continuous variables 


than formulation (§. This implies that the MILO solver needs to do more work at every node, 
by solving larger convex linear programs. However, the advantage is that the resulting formu¬ 
lation is more structured, and, thus, tighter lower bounds may be obtained by exploring fewer 


nodes. Section 6.2 presents some computational results illustrating the performance of the above 
framework. 


5.1 Specification of Parameters 


We present herein, several data-driven ways to compute the parameters in formulation (20). The 


methods presented here are quite different from those proposed in [8], where, the authors rely 
crucially on being able to compute analytic expressions for least squares solutions for a given 
subset size—such expressions are not available for Problem (§. 

5.1.1 Specification of Parameters via Linear Optimization 

We present several methods based on linear optimization that can be used to estimate the pa¬ 


rameters appearing in Problem (20), in such a way that these estimates lead to /3, a solution to 
Problem ([^. 

Bounds on /3i’s. Consider the following pair of linear optimization problems: 


s.t. 

■= 

s.t. 


max Bi 


XT(y-X/3)||oo <(i, 


( 22 ) 


min Bi 


XT(y-X/3)||oo <<5, 


for i = 1,... ,p. Note that and fi- provide upper and lower bounds on /3j for every i. fif is 


typically a strict upper bound to /3j, because (22) does not account for the fact that solutions 


to Problem © are sparse. Similarly, is a lower bound to /3i, and it is easy to see that 
M\j = max{|/r^|, |/i^ |} is an upper bound to \/3i\. Note that solutions to Problem ([22l) are finite 


only if the feasible set is bounded. If n > p and if the entries of X are drawn from a continuous 
probability measure, then the bounds are finite with probability one. The above bounds can 
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be made tighter by using information about upper bounds on Problem Q as obtained via the 


discrete first-order methods. We describe such methods in Section B.6 (Appendix). Once upper 
bounds on |/3j|, i.e. A4lr, are obtained, they can be used to compute bounds on ||/3||oo and ||/3||i 
as follows: 


do 


\\P\\oa<Mu= max M]j and ||3l|i < 


(*) 
u > 


where, oq denotes an upper bound to Problem (§ and > Mf > ...>M 


u ■ 


Bounds on (xj,/3)’s. Bounds on (xj,/3) can be obtained by solving the following pair of linear 
optimization problems: 


vf (ao) := 

s.t. 


Vi (ao) := 

s.t. 


max {xi,d) 

|XT(y-X/3)||oo <5 
||/3||oo < 

||/3||i < MuotQ 

nnn (xj,/3) 
|XT(y-X/3)||oo <5 
||/3||oo < 

||/3||i < Muoto, 


(23) 


for every i = 1,..., n. 


Analogous to the bounds derived via Problem (|22|), it is also possible to compute more conservative 


bounds on (xj,/3), by dropping the constraints ||/3||oo < -Mu and ||/3||i < Muoto in Problem (23). 


This gives nontrivial bounds even for the under-determined n < p case, as long as X has rank 


n (this is in contrast with the bounds from Problem (22) being vacuous when n < p). It is also 
possible to estimate bounds on (xj, /3) by including an additional constraint: ||X/3||oo < AI^, and 


using an iterative method as described in the Appendix, Section B.6 (see Step-l-Step-4) while 
computing bounds on the regression coefficients. 

The quantity Vj = max{u)^(Q;o )5 ~'?^)~(ao)} provides an upper bound to |(xj,/3)|. In particular, 
this leads to the following upper bounds: 


||x3||oo < max Vi and ||x3||i < V'vj, 


2 = 1 


leading to a data-driven method to estimate bounds appearing in (20). 


Computational Cost. Computing the quantities appearing in (22), (37) and (23) requires solv- 
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ing at least 2{p + n) linear optimization problems. However, these individual problems are quite 


simple to parallelize and they need to be solved once, before proceeding to solve Problem (20). 


These linear optimization problems can be solved by simplex based solvers quite easily for p in 
the lower thousands (typically less than a minute with Gurobi’s simplex solver). 

5.1.2 Specification of Parameters from warm-starts 


We present herein, simple practical methods to compute the parameter values by using good 

upper bounds to Problem (llj) . Let /3 denote a solution that corresponds to a good upper bound 

^0 '—' 

to Problem (|^. {3 can be obtained from Algorithm 3, for example. One can also use the 


solution obtained from Algorithm 3 as a warm-start to Problem (19) and allow it to run for 
a few minutes — the resulting estimate may be used as (3 . The parameters appearing in the 
bounds can be based on /3 , as follows. To be on the conservative side, we recommend setting 
the same bound for all the Af^’s: for example, they can all be assigned the value t||/3 ||oo. 


Similarly, a conservative bound for all the Af^*’s is given by r||X/3 ||oo- In addition, we can set 


0 , 


Alf = min|r||/3 ||o||/9 ||oo5'r||/3 ||i| and M.\ = r||X/3 ||oo, for some value of r G {1.5,2}. 

The method described above leads to parameter specific bounds as a simple by-product of our 


general algorithmic framework. Unlike the methods in Section 5.1.1, it requires no additional 


computation. On the other hand, the bounds in Section 5.1.1 are conservative, because they are 
implied by the bounds from Problem ([^. 


6 Numerical Experiments: Algorithmic Performance 


In this section, we report extensive numerical experiments that demonstrate: (a) the usefulness of 
the discrete first-order methods (Section |4.1[) in obtaining good quality upper bounds, especially 


when they are used to provide warm-starts to MILO solvers — this is shown in Section 6.1, and 


(b) how advanced warm-starts, coupled with the enhanced formulations presented in Section 5.1 


can be used to improve the overall run-time for off-the-shelf MILO solvers, when proving global 
optimality for the Discrete Dantzig Selector problem — this is shown in Section 6.2 


All computations were carried out on Columbia University’s high performance computing (HPC) 
facility, http://hpc.cc.coIumbia.edu/, on the Yeti cluster computing environment. The dis¬ 
crete first-order methods were implemented in Matlab 2014a, and we used Gurobi [25] version 


6.0.3. For all experiments in Sections 6.1 and |6.2| (except the large scale examples) we used 16GB 
of memory. 
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Type-1 (n = 100,p = 1000) Type-2 (n = 300,p = 1000) 


Data Fidelity 

Time 

Quality of Upper Bounds 

Parameter 

(in secs) 

With Warm 

Vanilla 


50 (*) 

0 

- 

1.55 

120 

0 

214.28 


500 

0 

14.28 


120 

6.66 

146.66 

5 

132(*) 

0 

73.33 


500 

0 

0 


35 (*) 

0 

- 

0.55 

120 

0 

29.62 


500 

0 

25.92 


40 (*) 

0 

- 

0.25 

120 

0 

73.44 


500 

0 

23.43 

Type-4 (n = 

58,p = 2000) 


Data Fidelity 

Time 

Quality of Upper Bounds 

Parameter 

(in secs) 

With Warm 

Vanilla 


300 (*) 

0 

220 

1.2S 

370 

0 

0 


600 

0 

0 


300 

16.66 

- 

5 

367 (*) 

0 

216.66 


600 

0 

0 


300 

5 

- 

0.55 

560 (*) 

0 

95 


600 

0 

95 


145 (*) 

0 

- 

0.25 

300 

0 

165 


600 

0 

60 


Data Fidelity 

Time 

Quality of Upper Bounds 

Parameter 

(in secs) 

With Warm 

Vanilla 


2(*) 

0 

- 

1.55 

120 

0 

0 


500 

0 

0 


90 (*) 

0 

42.85 

5 

120 

0 

42.85 


500 

0 

28.57 


57 (*) 

0 

200 

0.55 

120 

0 

86.66 


500 

0 

26.66 


120 

3.12 

25 

0.25 

210 (*) 

0 

15.62 


500 

0 

15.62 

Type-3 (n = 

600,p = 2000) 


Data Fidelity 

Time 

Quality of Upper Bounds 

Parameter 

(in secs) 

With Warm 

Vanilla 


500 

7.14 

- 

1.55 

530 (*) 

0 

- 


950 

0 

28.57 


500 

11.11 

- 

5 

875 (*) 

0 

137.03 


950 

0 

33.33 


55 (*) 

0 

- 

0.55 

500 

0 

- 


950 

0 

120.37 


500 

0.8 

- 

0.25 

560 (*) 

0 

- 


950 

0 

77.6 


Table 1: Tables showing “Quality of Upper Bounds”, defined as 100 x (haig — h)/h, where h^ig refers to 
the objective value obtained by algorithm “alg” (at the given time), and h is the best objective value found 
in the entire run-time duration of all the algorithms. Two cases of “alg” S {“With Warm”, “Vanilla”} 
have been considered: “With Warm” denotes MILD warm-started with a solution from Algorithm 3 (the 
structured formulations of Section are not used here); “Vanilla” denotes a MILD solver without any 
warm-start specification. “With Warm” is found to obtain the best upper bound for a given computation 
time-limit in all the instances. In several instances, MILD is found to improve the solution obtained via 
Algorithm 3, after accepting it as a warm-start. For Type-1,2 the total time limit was 500 secs; for Type-3 
it was 950 secs, and for Type-4 the algorithms were considered for a total time limit of 600 secs. For method 
“With Warm”, the times reported show the overall time taken by Algorithm 3 and the MILD algorithm. 
An asterisk “(*)” indicates that the best solution is obtained at that time. A “-” means that no feasible 
solution was obtained by the algorithm in that time. 
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6.1 Obtaining Good Quality Upper Bounds 

From a practical viewpoint, being able to obtain good quality upper bounds to Problem Q is, 
perhaps, of foremost importance. To demonstrate the effectiveness of our computational frame¬ 
work in this regard, we perform a series of experiments on the data-types described below. 

Type-Synth: We generate a Gaussian ensemble xp ~ MVN(0, S), where aij = pi* -^1 for some 
value of p G [0,1), with the convention that 0^ = 1. The underlying true regression coefficient 
vector, (3* G has /3t = 1 for k* equi-spaced values of j G {l,...,p} and = 0 for the 
remaining values of j. 

Type-1: This is of Type-Synth with n = 100, p = 1000, p = 0, k* = 10. We studied 
Problem Q for four different values of the parameter 6 set at 5(1.5,1, 0.5,0.2) with 6 being 
defined below. 

Type-2 : This is of Type-Synth with n = 300, p = 1000, p = 0.8, k* = 25. Here 5 values were 
set as 5(1.5,1,0.5,0.2). 

Type-3: This is of Type-Synth with n = 600, p = 2000, p = 0.8, k* = 40. Here 5 values were 
set as 5(1.5,1,0.5,0.2). 

Type-4 : This is a semi-synthetic dataset: we considered the Radiation sensitivity gene ex¬ 
pression datasef0 from Ch. 16 of the book [26]. The features were randomly downsampled to 
p = 2000 and there were n = 58 observations. We generated response y based on a linear 
model with ||/3*||o = 10, j3* = 1 for j < 10, and = 0 for j > 10. Here 5 values were set as 
5(1.2,1,0.5,0.2). 

Type-5 : This is of Type-Synth with n = 1000, p = 3000, p = 0, k* = 10; we considered one 
value of 5, which was set to 5. 

In each of the above examples, after X was generated, we standardized its columns to have zero 
mean and unit .^ 2 -iiorm. Then, the response was generated as y = X/3* -|- e, where e* ~ 1V(0, u^), 
and cr^ was adjusted to match the selected value of SNR= Var(x''~/3*)/cj^ (taken as 3 in all the 
above cases); the reference value of the tuning parameter was set to 5 = ||X'''(y — X/3*)||oo. 

We studied different first-order algorithms described in Section Algorithm 3 was empirically 
seen to have the best performance over its constituents. Algorithms 1 and 2, when used separately. 
Hence, we used Algorithm 3 in all the experiments to obtain good upper bounds to Problem Q. 
The solution obtained from Algorithm 3 was passed as a warm-start to the MILO formulation Q 
(for a large value of Aiu = 10^) — this hybrid MILO approach is denoted by “With Warm” in 
Table We compared this method with the vanilla MILO formulation Q (for a large value of 
Aiu = 10^), which was implemented without any warm-start information. Table shows the 

^We downloaded the dataset from the website http://statweb.stanford.edu/~tibs/ElemStatLearn/ 
datasets/ 
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objective values obtained by these two methods — the MILD algorithm aided with advanced warm- 
starts was found to perform the best across all the examples. For the hybrid approach (“With 
Warm”), in many of the instances, the solution obtained by Algorithm 3 was further improved by 
MILO. In some cases, the vanilla MILO approach took a while before it was able to find a feasible 
solution. For example, in the Type-5 setting (which does not appear in Table the best solution 
was delivered by Algorithm 3 within one minute; in contrast, the vanilla MILO algorithm failed to 
find a feasible solution within 1000 seconds. 


Diabetes Dataset (n = 442, _p = 64) 

I|3||o = 31 I|3||o = 14 ll3llo = r 





Time (secs) Time (secs) Time (secs) 

Figure 3: The evolution of MILD Optimality gaps (defined in Figure]^ as functions of time (in secs) for 
the MILO methods with and without problem-specific information. We consider three different values of 
the data fidelity parameter <5, leading to solutions with different number of nonzeros as mentioned in the 
figure panels. Here, “With Intelligence” refers to a MILO algorithm aided with an advanced warm-start, in 
addition to the bounds specified in Section and “Vanilla” refers to a MILO algorithm without any such 
additional information. MILO is found to benefit significantly from additional problem-specific information. 


6.2 Lower Bounds and Certificates of Optimality 


Here we demonstrate how our framework delivers certifiably optimal solutions to Problem ([^. 
In our first series of experiments we considered the popular diabetes dataset [19], which we 
examined with interaction terms included, giving us p = 64 and n = 442. All the features and 
the response were mean-centered and standardized to have unit f’ 2 -norm. Figure shows the 
performance of two versions of MILD - “With Intelligence” and “Vanilla”. “With Intelligence” 


refers to MILO formulation (20), where a MILO solver is provided with an advanced warm-start, 
;0 


say, (3 . The parameter specifications are obtained based on the method in Section 


5.1.2 


here, we 


used the box constraints and £i-constraint on (3. The “Vanilla” version of MILO was not provided 
with any such problem-specific information — we used formulation ([^, as in Section 6.1 Our 
experimental results (Figure show that “Intelligence” significantly enhances the performance 
of the MILO solver, in terms of proving global optimality. Usually, we observe that for a fixed n,p 
with n > p the time to certify optimality is smaller when k is small or close to p ~ intuitively, this 
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ll/3|Io) = (400,1000.20) (n,p, ||/3|to) = (600,1500,15) 


(ri,p. ||/3|[o) = (750,2000,20) 


(n,p, |[/3||o) = (900,2500,20) 




With Intelligence 



600 800 1000 1200 1400 1600 


Time (secs) 


Time (secs) 


Time (secs) 


Time (secs) 


Figure 4: The evolution of MILD Optimality gaps (defined in Figure]^ as a function of time (after a warm- 
start was supplied to it) for different synthetic examples with varying problem-sizes. “With Intelligence” 
refers to MILO aided with an advanced warm-start information, in addition to the bounds specified in 
Sectionj^ as described in the text. The “Vanilla” version of the MILO was found to perform worse, and is, 
thus, not shown in the figures. In all these instances, the £i-Dantzig Selector resulted in denser solutions. 


is due to the “search-space” being small. The computation time increases as k becomes closer to 
p/2. This is reflected in Figures and 


6.2.1 Moderate Scale Examples 


We considered some examples of Type-Synth for p = 0 and different values of n,p, k*] in all the 
examples, we set 5 = ||X'''(Y — X/3*)||oo- The results for MILD with intelligence are displayed 
in Figure We obtained an advanced warm-start (/3 ) from a combination of Algorithm 3 and 
MILO formulation (19), where the latter was allowed to run for an overall time limit of 500 seconds. 
The warm start (3 was used to initialize formulation (20) — the parameter specifications in the 
formulation were obtained based on the method in Section 5.1.2 We also experimented with the 


version of formulation (20) that considers only box constraints on /3; the results were often found 
to be roughly similar — both methods certified optimality, though there were some differences 
in the total run-time (roughly around a few minutes). For all the synthetic examples presented 
in Figure the vanilla version of MILO took much longer to prove optimality and hence they are 
not shown in Figure 


In all these instances, the £i-Dantzig Selector, not surprisingly, resulted in a solution that was 
more dense than the corresponding Discrete Dantzig Selector. 
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n 

p 

k* 

(Synthetic Examples) 
Upper Bound Lower Bound 

MILD Gap 

Time (hrs) 

2,000 

5,000 

30 

30 

30 

0 

8.3 

4,000 

5,000 

60 

60 

60 

0 

20.0 

7,000 

7,000 

20 

20 

20 

0 

21.7 

6,000 

7,000 

60 

60 

60 

0 

20.0 

4,000 

8,000 

20 

20 

20 

0 

41.9 

3,000 

8,000 

20 

20 

20 

0 

18.3 

3,000 

9,000 

20 

20 

20 

0 

47.2 

4,000 

9,000 

20 

20 

20 

0 

44.4 

1,000 

10,000 

10 

10 

10 

0 

14.2 

5,000 

10,000 

10 

10 

10 

0 

2.5 

5,000 

10,000 

20 

20 

20 

0 

47.5 

10,000 

10,000 

30 

30 

27 

10% 

42.5 

n 

p 

k* 

(Real Data Examples) 
Upper Bound Lower Bound 

MILO Gap 

Time (lirs) 

6,000 

4,500 

20 

20 

20 

0 

5.0 

6,000 

4,500 

40 

40 

37 

10% 

12.5 


Table 2: Table showing times taken to reach global optimal solutions for several problem instances (both 
synthetic and real data), with p up to 10,000. For each instance, we list the upper bound, the lower 
bound, the corresponding MILD (Optimality) Gap and the time taken to reach the listed lower bound by 
the MILD solver equipped with “Intelligence” (as described in Figure]^. The times (in hours) refer to 
those taken by the MILD solver after being provided with a warm-start. In all the instances, the best 
upper bound was obtained around approximately one hour, however, it took much longer to obtain a 
certificate of global optimality via the (almost) matching lower bounds. The MILD Gap was found to be 
zero in all the instances apart from the two where the algorithm was terminated upon obtaining a lower 
bound within 10% of the upper bound. The results demonstrate that certifiably optimal Discrete Dantzig 
Selector solutions can be obtained for large scale instances. 


6.2.2 Large Scale Examples 


We consider several large scale examples with p ranging in 4,500 to 10,000: these problem-sizes 
are orders of magnitude greater than those considered in [8] . These computations were performed 
with 100 GB memory. 


We studied a host of synthetic examples, all generated as in Section 6.2.1 We also considered a 


semi-synthetic dataset derived from the well-known Gisette data http: //archive. ics. uci . edu/ 
ml/datasets/Gisette, Here, we generated a response y, based on the Gisette data covariates 
(each feature was standardized to have zero mean and unit £2 norm) — here, n = 6000 and 
p = 4500, we set /3* = l,i = 1,..., k* and the remaining /3* = 0; SNR=3 and considered two 
instances with k* = 20,30. 


The algorithmic set-up was similar to that used in Section 6.2.1 The results are presented in 
Table In all these instances, the £i-Dantzig Selector resulted in a solution that was more 
dense than those obtained via the Discrete Dantzig Selector. For all the synthetic examples, 
the MILO solver delivered solutions that matched the optimal solution of the data generating 
mechanism. Typically, the time taken to prove optimality marginally increases with larger values 
of k* (for a fixed re,p); for a fixed k*,p the times taken to certify optimality increases with 
decreasing values of n. The examples demonstrated in this paper, show the largest instances 
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of discrete optimization problems for exact variable selection, that can be solved to provable 
optimality. 


7 Numerical Experiments: Statistical Properties 

We conducted a series of synthetic experiments to understand the statistical properties of the Dis¬ 
crete Dantzig Selector and compare them to those of the £i-Dantzig Selector and variants. 

We used the following datasets in our analysis. 

Example-A: This is of Type-Synth with n = 200, p = 500, /? = 0 and k* = 20. 

Example-B: This dataset was similar to the one taken in Example-A, but the amplitudes and 
signs of the true regression coefficients were allowed to vary: the twenty nonzero /?t’s were equally 
spaced in the interval [—10,10]. 

Example-C: This is of Type-Synth with n = 100, p = 500, p = 0.85, k* = 10. 

Example-D: We set n = 100,p = 300 and let X ~ MVN(0,5]), where ai 2 = 0-21 = 0.7, and 
all the remaining ajk are equal to zero. We also took (5\ = 1 and = —1, with the remaining 
coefficients set to zero, resulting in /c* = 2. (This example is a larger version of Example V 
described in Section and illustrated in Figure [^) 

In each of the above cases, after X was generated, we standardized its columns to have unit 
.^ 2 -norm. Then, the response was generated as y = X/3* + e, where e* ~ A(0,cr^), and cr^ was ad¬ 
justed to match the selected value of SNR, which was varied across {3,10} in the examples. 

We considered the following estimators in our analysis: 

• “Warm” — this method applies a heuristic strategy to obtain upper bounds to Problem Q . 
We usec0 Algorithm 2, described in Section 

• “LO-DS” — the solution obtained from “Warm” is taken as a warm-start to a MILO solver 
and subsequently allowed to run with a time limit of 4000 seconds. 

• “LO-DS-Pol” — this is a “polished” version of the Discrete Dantzig Selector estimator “LO- 
DS” and is obtained by performing a simple least squares fit on the support of the “LO-DS” 
estimate. 

• “Ll-DS” — this is the original .^i-Dantzig Selector. 

• “Ll-DS-Pol” — this is a polished version of the “Ll-DS”. 

®This is similar to a re-weighted ^i-minimization [15] method applied to Problem We took the penalty 
oc log(|/ 3|/7 + 1 ) on a geometrically decreasing grid of ten 7 values: 7 i = 10 “^ x 0 . 8 '“^ for i = 1 ,..., 10 . 


4.1.2 
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Each of the above estimators were computed on a range of approximately thirty different S values 
around S = ||X^(y - X^/3*)||oo. We considered ten different replications (based on different e 
realizations) and took the median of the results. The optimal tuning parameter for every 

model was selected based on the value of S that minimized the estimation error with respect to the 
true regression coefficients. For this chosen value of we considered different metrics to assess 
the performance of the different estimators. We computed the squared t' 2 -error in estimating the 
regression coefficients: ||/3 — /3*|||. We also considered the “Variable Selection error”, which is 
defined as 7 ^ S*), where Sj is the jth coordinate of S := Supp(/3), and S* is the jth 

element of S* = Supp(/3*). Finally, we computed the “number of nonzeros”, which refers to the 
number of nonzero coefficients in (3. 

A collection of representative results with SNR=10 is displayed in Figure The error bars 
correspond to standard errors, the width being set to 2s/y/N where, s is the mean absolute 
deviation around the median and N denotes the number of replicates (here, ten). A larger display 
of additional examples with varying SNR values is presented in Table in Appendix where we 
also report the “Prediction Error”, defined as ||X/3 — X/3*|||/||X/3* |||. In Table 0 given in the 
same section, we provide comparisons with the polished version of the £i-Dantzig Selector. Our 
experiments show that polishing the t'l-Dantzig Selector may lead to marginally better solutions 
relative to the original Dantzig Selector, but the corresponding statistical performance is inferior 
to that of the estimates based on the Discrete Dantzig Selector estimator. 

We note that the performance of the Lasso was found to be quite similar to that of the £i-Dantzig 
Selector. The statistical performance of the subset selection procedure (Q, as described in [S], was 
found to be similar to that of the Discrete Dantzig Selector for p < 1000. Because the main focus 
of the paper is to show that Discrete Dantzig Selector is a computationally tractable procedure, 
which delivers estimates with better statistical properties than its ii counterpart, we restrict our 
numerical studies to the methods listed above. 

Summary of Findings. Based on the experimental results, we observe that the Discrete Dantzig 
Selector and its polished variant perform quite well, when compared to the competing methods 
in terms of estimating f3*; they also demonstrate superior variable selection properties (not sur¬ 
prisingly, the £0 methods obtain the sparsest models across all the examples). “Warm” does not 
perform very well when compared to “LO-DS”, even though both methods attempt to solve Prob¬ 
lem Q — this suggests that estimators based on rigorous optimization procedures have better 
statistical properties. We observe that “LO-DS” and “LO-DS-Pol” possess similar variable selec¬ 
tion properties, however, the latter may lead to better estimators of f3* and X/3*, due to the least 
squares post-processing. Polishing of the fi-Dantzig Selector may not lead to better solutions, 
due to the weak variable selection properties of “Ll-DS”. In some cases, when the value of p is 
quite large and, consequently, the covariates are highly correlated (see Example-C, Example-D), 
the basic problem of variable selection becomes difficult: instead of choosing a “signal” variable, 
the “LO-DS” chooses its correlated surrogate. In these cases, as expected, we observe that the 
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Example-A 

(n = 200, p = 500) 


Example-C 

(n = 100, p = 500) 


Example-D 

(n = 100, p = 300) 



Figure 5: The statistical performance of the Discrete Dantzig Selector (LO-DS); its polished version, 
which does a least squares refitting on the support (LO-DS-Pol); the heuristic estimates delivered by 
Algorithm 2 (Warm); and the original £i-Dantzig Selector (Ll-DS). We display three different metrics: 
[top panel] squared £ 2 -error in estimating /3, [middle panel] the 0-1 variable selection error and [bottom 
panel] the number of nonzeros in the optimally selected model; the horizontal dotted line shows the number 
of nonzeros in the “true” model. We observe that the Discrete Dantzig Seleetor based approaches perform 
very well in terms of obtaining a model with high quality estimation and variable selection properties. Their 
models are substantially sparser than those for the £i-based methods. The heuristic approach, “Warm” 
which approximately optimizes Problem ([^, falls short in obtaining high quality statistical estimates. A 
number of additional experiments (with results similar to this figure), are presented in Section]^ of the 
Appendix. 
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Example 

Metric 

Time (secs) 

LO-DS-Pol 

MIPGO 

p = 0 

Variable Selection 

60 

0 

20 

n = 100 

Error 

200 

0 

14 

p = 200 


60 

0.309 

20.00 

k* = 20 

11/3-rili 

200 

0.309 

8.79 

II 

o 

Variable Selection 

60 

0 

10 

n = 100 

Error 

200 

0 

4 

p = 200 


60 

0.063 

0.271 

k* = 10 

11/3-/ 3 II 2 

200 

0.063 

0.076 

II 

o 

Variable Selection 

100 

0 

9 

n = 300 

Error 

500 

0 

5 

p = 600 


100 

0.094 

9.142 

k* = 20 

11/3-/3111 

500 

0.094 

1.548 

II 

o 

Variable Selection 

150 

0 

25 

n = 300 

Error 

1000 

0 

16 

p = 1000 


150 

0.175 

25.00 

k* = 25 

11/3-/31I1 

1000 

0.175 

14.893 


Table 3: In all the above instances, the generated data is of Type-Synth, with SNR=10. Both the 
Discrete Dantzig Selector and MIPGO (with the MCP penalty) methods were run on twenty different 
values of the tuning parameter, and the best solutions are reported. LO-DS-Pol refers to the least squares 
solution obtained on the variables selected by Discrete Dantzig Selector. For every value of the tuning 
parameter, each method was run for a time budget of ti,t 2 seconds with ti < t 2 specified in the “Time” 
column. The Discrete Dantzig Selector method reaches the best solution within ti seconds, much earlier 
than its competitor. The MIPGO method is seen to take orders of magnitude longer to get a solution of the 
same quality as the Discrete Dantzig Selector — the differences become more pronounced with increasing 
problem size. 


“LO-DS” incurs relatively large variable selection error-the prediction accuracy of these mod¬ 
els however, demonstrate a more optimistic picture than the variable selection properties (See 
Table |§. 

7.1 Comparisons with Least Squares Subset Selection 

We discuss some comparisons of our proposal with the recently proposed methods: Problem Q 
by [8] and MIPGO [33] , 

For underdetermined problems (with n < p) the authors in [8] (see Section 5.3.2 in 0) point 
out that MIQO solvers take a long time to certify optimality, by producing matching upper and 
lower bounds. For problems with n < 100 and p = 1000, [8] demonstrated how the MIQO methods 
for Problem (Q could certify local optimalitjj^ in a (small) bounding box around a candidate 
solution. We observed in our computational experiments that Problem ([^ is orders of magnitude 
faster than (|^ for underdetermined problems, in obtaining solutions with certificates of global 

®We note that certifying local optimality i.e., optimality in a neighborhood of a candidate solution is also an 
NP-hard problem. 
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optimality. On several randomly generated problem instances generated as per Type-Synth with 
n = 100, p = 200, p = 0, k* = W and SNR=10, Problem Q was solved to global optimality, 
i.e., zero optimality gap with a median time of about 4 minutes. On the same instances, the 
MIQO formulation for Problem Q took more than 7 hours of computation time to obtain similar 
optimality certificates. In addition, a MILO formulation for Problem ([^ consumes much less 
memory than a comparable MIQO formulation for Problem Q. For example, on problem instances 
with n = p G {1-5, 2, 2.5, 3} x 10^, we observed that Problem Q requires at least twice as much 
memory as that for Problem Q, within the first 800 seconds of computation time. The memory 
requirement for a MIQO for Problem Q with n = p = 3000 was more than 12GB. 

MIPGO |33j is a discrete optimization framework for minimizing a regularized version of the least 
squares loss, with a nonconvex quadratic penalty (for example, SGAD or MGP). This corresponds 
to a nonconvex quadratic optimization problem, which the authors express as a discrete linear 
optimization problem via linear complementary constraints [2Tj- This representation results in 
many more binary variables (several multiples of p) than that required for the Discrete Dantzig 
Selector. For example, in the case of the MGP penalty, the paper |33] presents a MILD with 4p 
binary variables and many more continuous variables. In particular, with n = 300 and p = 1000, 
the MIPGO solveip^ creates a problem with 38,000 variables and 27,000 equality constraints, 
which after presolve reduces to a problem with approximately 7,000 continuous and 4,000 binary 
variables. These optimization problems are substantially larger than ([^, which is a MILO with p 
continuous and as many binary variables. As a result, the MIPGO formulation seems to become 
computationally expensive as the dimensionality of the problem increases. This is illustrated in 
Table [^“ we show on several synthetic instances that, with a particular time budget, the Discrete 
Dantzig Selector formulation Q equipped with a warm-start from Algorithm 2, leads to better 
solutions than MIPGO. The quality of solutions produced by MIPGO is found to improve with 
more computation time, but the time taken can be substantially larger than that of the Discrete 
Dantzig Selector. The synthetic datasets for Tablewere generated the same way as in Section]^ 
We set the concavity parameter for the MGP penalty in the MIPGO code to its default value of 
a = 2. 
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A Proofs for Section 3 

Proof of Theorem Let e denote the projection of y onto the orthogonal complement to the 
space spanned by the predictor vectors Xj,j = 1, ...p. Note that 

l|y - X/3f = '^{Cj - I3jf + ||ef. 
i=i 

Thus, under the constraint ||/3||o < k, the smallest sum of squares is achieved by setting f3j = Cj 
for j = 1, k and jSj = 0 for j = A: + 1, This completes the proof of part 1. 

Note that 

|xJ(y-X/3)| = \cj- Pj\. 

Thus, the constraint ||X^(y — X/3)||oo < 5 is satisfied if and only if G [cj — 6,Cj + (i] for 
j = 1, ...,p. In order to minimize ||/3||o, coefficients /3j for which 0 G [cj — 6, Cj + d] are set to zero. 
Thus, I3j = 0 if and only if \cj\ < 5, which implies I3j = 0 for j > k. This completes the proof of 
part 2 . 


Proof of Theorem Throughout this proof we omit the words “with probability tending to 
one” to improve the presentation. The constraint ||X''"(y — X/3)||oo < S implies ||X''"X(/3* —/3) + 
X''~e||oo < Recall that 5 = and note that ||X~'^e||oo = Op(l), due to the scaling of the 

predictors and the assumptions on the e. Consequently, ||X^X(/3* —/3)||2 = Op{n^^^). Because 
X^X converges to an invertible matrix C, we conclude that there exists a Op(l) sequence of 
random variables bn, such that the bound 

11 ^- 1/23 _ ^- 1 / 2^*112 <5^ 

simultaneously holds for all the Discrete Dantzig solutions (3. Recall that (3* = n^/^3 i some 

~ 3(S 

fixed vector (3 . Consequently, ^ 0 implies / 0 for j = 1, ...,p. In other words the support 
of each Discrete Dantzig solution contains J*. It is only left to show that the cardinality of 
each such support cannot be greater than | J*|. Note that with probability tending to one, (3* is 
feasible for the Discrete Dantzig optimization problem. Indeed, 

||xT(y - X/3*)||oo = llX^elloo = Op(l), 

which is bounded above by 5, due to the assumption J —)■ 00 . Thus, inequality ||/3||o < | holds 
for each Dantzig Selector solution, which completes the proof of part 1. 
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In the paragraph above we deduced that the minimum value of the Discrete Dantzig objective 
function equals | J*|. We also showed that f3* is feasible for the Discrete Dantzig optimization 
problem. A similar argument establishes the feasibility of f3^. Consequently, /3* and (3^ are 
indeed Discrete Dantzig solutions, which completes the proof of part 2. 


Proof of Proposition!^ Consider an arbitrary nonzero 6 G MP, such that ||0||o < 2/c. Let Jq 
be the index set of the k largest, in magnitude, coordinates of 9. Observe that \\6 jc||i < II^Jolll 
and ||0||2 = ll^Joilb, because m> k. Thus 




110 


Joi Il2 


> K{k, Co, m), 


for Co > 1, which implies 'y{2k) > K{k, co,m). Also note that 


2 



> 


X0||| 

0Jo Hi 


> [K{k,co)f , 


for Co > 1, which gives j{2k) > K{k,co)/V2. 


Proof of Theorem and Corollary Note that X^e is a mean zero Gaussian vector, such 
that the variance of each component is Consequently, it follows from well-known maximal 
inequalities for Gaussian variables that the bound ||X^e||oo < S holds with probability at least 
1 — (p^t/tt logp)“^. The rest of the proof is conducted on the set where the above bound is valid. 
Note that on this set /3* is a feasible solution for the optimization problem Q, which implies 
11/3Ho < 11/3* Ho- Recall that we denote ||/3*||o by s* and derive the following inequalities: 


7(2s*)2 


< 


/3-/3^ 

= 0 - - (3 

< 

< 

-T, 


X(/3 - P* 


X’^X(3-/3*) 


3-/3* 



00 

] 


XT(y-X3) 

CX) 

+ n-i X 


/3-/3* 


Because both ||X^(y — X/3)||oo and ||X^e||oo are bounded above by 6, we derive 

2 


/3-/3* 




/3-/3* 


Applying inequality ||/3 — /3*||i < (2s*)^/^||/3 — P *\\2 to either the left or the right hand side of 
the above display yields the and the I 2 estimation bounds, respectively, in the statement of 
Theorem [3l 
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Finally, to establish the prediction error bound, observe the following inequality: 


X(/3 - (3* 


< 25 


l3-f3^ 


which is is a direct consequence of the two displays given above. We then complete the proof of 
Theorem by combining the above display with the inequalities 


3 -/ 3 * 


< (2s*)^/2 

3-/3* 

2 


X(3-/3*) 


Corollary follows directly from the ii estimation bound in Theorem 

Proof of Theorem 1^ We again focus on the set of high probability, where inequality ||X''^e||oo < 
6 holds. Because (3* is a feasible solution to the optimization problem Q, we have ||/3*||o > ^lb- 
Thus, ||/3||o < (1 + V’)||/3*||o- The rest of the proof is identical to the one for Theorem]^ with one 
exception: the bound ||/3||o < ||/3*||o contains an additional factor (1 + ijj). 


B Additional Algorithm Details and Proofs 


B.l Details on Algorithm 1 

Update 0 in the ADMM algorithm can be performed via a hard thresholding operation 
as it is of the form: 

3(A') := argmin \\(3 - c||| + A'||/3||o, 

/3 


(24) 


for an appropriately chosen X' = j and c = o: + jW, a solution is given by (3j{X') = Cjl{\cj\ > 
\/A),i = l,... ,p. The update step p2|) involves the following projection: 


^l|2 


min /(ct) := ||q: — c ||2 

CX. 

s.t. ||xT(y-Xa)||oo <(i, 


(25) 


where c = (3 — u/X. While the projection (25) can be computed by using standard quadratic 
programming methods, in our experience, we found theirp^to be quite time consuming for larger 
problems {p > 1000), especially because this projection needs to be computed for every iteration 


(indexed by k) of (11)-(13). Thus, we recommend using specialized hrst-order methods — these 
methods also naturally make use of warm-start information, which is particularly useful to us 
due to the iterative nature of the updates Unless X has uncorrelated columns, it is 


not straightforward to solve (25) in its primal form — we thus consider a dual of Problem (25), 


^Our reference is Gurobi’s quadratic programming solver. 
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for which we apply first-order methods for convex composite minimization [36]. To improve the 
flow of presentation, we relegate the description of a more general first-order method, which also 
applies to Problem (25), to Section B.3 in the Appendix. We repeat steps ([Ti])-(@ until an (ap¬ 


proximate) convergence criterion is met — see for example [341 |T2] for convergence results for the 
general method. We terminate the algorithm as soon as the successive changes in the updates 
become small and one has approximate primal feasibility (see Step (3) in Algorithm 1 ). 


B.2 Additional Details on Algorithm 2: Solving Problem ( [l7j ) 

Observe that Problem © is of the composite form |36] : 

min fi{0) + f 2 {O) s.t. 0 G C, (26) 

0 

where the function fi{9) is smooth, with its gradient Lipschitz continuous: ||V/i(0) — V/i(0^)|| < 
L\\6 — 0^11; f2{0) is nonsmooth and C is a convex set. In our specihc case, the smooth component 
is the zero function, f2{0) = C = {6 : ||X'''(y — X0)||oo < (i}- Thus, one may 

appeal to first-order optimization methods [21 EH ESI for composite function minimization. This 
requires solving, at every iteration, a problem of the form: 

Qm+I ^ argmin k\\0 -O'^Wl + E 

0 i=i (27) 

s.t. ||XT(y-X 0 )||oo <5, 


for some choice of L > 0 and 6 , and Wi = |/0^(|/3i|)|. If 0 then the above update sequence 

becomes identical to proximal gradient descent [2] . One may also use accelerated gradient descent 
methods, with a momentum term. We describe in Section B.3 first-order gradient methods that 
can be used to compute solutions to Problem (27). The sequence 0™, defined via (27), leads 
to the solution of Problem © as m —>■ oo, providing a 0 (^)-suboptimal solution in m many 
iterations if one uses standard proximal gradient descent methods; the convergence rate can be 
improved to O(^) if one uses the accelerated gradient descent version of the algorithm. 

Instead of choosing fi{0) = 0 one may also choose fi{9) = for a small value of r > 0. 

Interestingly, for small values of r the minimizer to Problem (26) is also a minimizer of the 
problem with the choice fi{9) = 0. This equivalence of solutions which holds true in much 
more generality is often known as exact regularization of convex programs in the mathematical 
programming literature — see for example [22] ■ Even if the two problems are not equivalent, the 
choice of fi{0) = §|| 0||2 always serves as an approximate solution to Problem (|I7|). With this 
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choice of fi{0), one needs to solve a problem of the form: 


min 

e 

s.t. 


I^lli + '^wi\di 


2=1 


|xT(y-X 6 >)||oo < <5. 


A solution to the above problem can be computed by considering its dual and applying (accel¬ 
erated) proximal gradient methods on the dual formulation, as described in Section B.3 In this 
approach, a two-stage iterative algorithm of the form (27) described above, is not required. 


Algorithm 2 suggests that we solve Problem repeatedly for different values of 7 — it turns 
out that the overall cost for solving all these problems is quite small. This is because (a) the 
problems do not change much across different values of 7 ; and (b) for a fixed 7 , while moving 
across different values of k, the linear optimization problems are quite similar since the weights 
|p(y(|/3j^|)| do not change much across k. Thus the solutions obtained from one linear optimization 
problem can be used as a warm-start to solve the next linear optimization problem. This is 
found to reduce the overall computation time. Both the first-order methods (described above) 
and simplex methods can gracefully take advantage of warm-starts. 


B.3 Dual Gradient Method 


Here we describe how to solve a problem of the form: 


min illo: — clln 

Q. ^ 


+ '^Wi\ai\ 

2=1 

s.t. IIAq — b||oo < S, 


(28) 


where we assume that wt > 0 and the set {a : ||Aq: — b||oo < <5} is nonempty. Note that the 


constraint set in (28) makes solving the primal form (28) challenging. However, due to the strong 


convexity of the objective a dual is smooth (has Lipschitz continuous gradient) and the non 
differentiability nicely separates across the dual variables - we thus use dual proximal gradient 


algorithms to optimize (28). This trick is often used in optimization and signal processing, see 
for example m- 


To derive a dual for Problem (28), we note that it can be written equivalently as: 

p 


min illo: — cUn 

a,C 2 


2=1 


s.t. IlCIloo < S, 

= Aq — b. 
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The minimum of the above problem can be obtained by maximizing a dual problem, obtained 
by dualizing the equality constraints C = Aq — b; this consequently leads to the following 
problem: 

g{fi) := min ( ^ ||q; - c||| + (/^, C - (Aa - b)) 

C,a:||CI|oo<<5 V 

P \ 

+ Wi\ai\). 

i=l ^ 


The above can be simplified to; 

5(/^) 


= mm I 2 net — 


IIIq: - c\\l - ||/i||i(5 - {fi, (Aq - b)) + '^Wi\ai\ 

=51 (/x) - 5||At||i, 


2 = 1 


where, 


gi{n) = nun I jHo: - cH^ - {fJ,, (Aq - b)) + '^Wi\ai\ 

V i=l 

Note that S, the unique minimizer of the above problem, is given by: 

p 


S=argmin |||q: — (c + 


Wi\ai\ 


2 = 1 


i.e., Oi =sgn(ci + ajfi) ■ max ||ci + aj- Wi, o| 


for i = 1,... ,p, where a* is the rth column of A. It follows from standard convex analysis m 
that the function i—)• gi{pd) is differentiable with its gradient given by: 

V5i(m) = -(Aq - b), 

and its gradient is Lipschitz continuous: 

||V5i(/x) - V5i(/x')|| < IIAIIIIIai - /x'll, 

where || A ||2 denotes the largest singular value of A and for a vector u, the term ||u|| denotes the 
usual ^ 2 -norm of u. 


By using standard quadratic programming duality theory m , the minimum of Problem ( |2^ can 
be obtained by maximizing the unconstrained dual problem 5 (/x) in the dual variable /x, which 
is equivalent to the following minimization problem: 


min - 5 (/x) = min(- 5 i(/x) + 5||/x||i). 
u p- 


(29) 
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This problem is of the composite form [36], and proximal gradient descent methods iniiiiiiEiEi 
apply to it directly. 


For the special case of Problem (25), the method described above applies with rcj = 0, i = 1,... ,p. 


Clearly, (29) is an -regularized quadratic program, with the primal dual relationship being: 


Q = c -I- fl. 


Note that Problem (28) needs to be solved several times during the course of Algorithm 1 and 


Algorithm 2, across the different iterations. Fortunately, these problems are not completely 
unrelated, in fact, they are quite “similar”. In Algorithm 2 the weights Wi change; and in 
Algorithm 1, the parameter c changes. Since the problems are similar, it is not unreasonable to 
expect that the optimal dual variables corresponding to these two problems do not change much. 


Thus, it is useful to initialize the dual variable for one instantiation of Problem (28) with the 


(dual) solution obtained from another instantiation of Problem (28). This simple strategy leads 
to substantial performance gains over solving the problems independent of one another. 

B.4 Proof of Theorem 

Proof. Note that the sequence f3^, defined via satisfies the following relationship: 

> nun h{p-,P^) s.t. ||X^(y - X/3) ||oo < <5 




Observing that 


we have: 




fc+i^ 


hiP'^) = h{p^]P^) > hiP'^+^-p'^) > h{P'^+^), 


fc+l. /3fc\ 




(30) 


and, thus, the sequence h{P^) is decreasing. Subtracting h{P^) from all sides of the above 
inequality, we derive: 


0 > /i(/3^+^;/3^) - hip'^) > h{p'^+^) - h{p'^). 




The first part of the above display gives us, using (16): 


0 >h{P^+^-,p^) - h{p^) 


i=l 

which means A(/3^) < 0 for all k. If A(/3^) < 0, then P^^^ leads to a strictly improved value 
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of the objective function. If A{(3^) = 0, then is a fixed point of the above update equation. 
Hence, A{f3^) is a measure of how far (3^ is from a first-order stationary point of Problem (15). 


The display in (30) shows that the objective values are decreasing, and, because the objective 
values are all bounded below (by zero), the decreasing sequence converges. 


In addition, we have that 

h{(3^) - > -A(/3^). 

Adding the above for A; = 1,..., /C, we have: 


> E {-A(/3")} 


K>k>l 

(31) 

> K, min |— A(/3*^)| , 
i<k<K '' ^ 


which leads to the following convergence rate: 


.SA{-‘"(X)}<4w/3Vm/5'=«)) 

(32) 

lA 

1 

(33) 


where (33) follows from (32) by using the observation that h{(3 ) j, h. 


□ 


B.5 Additional details on Algorithm 3 


We seek an upper bound to a simple variant of Problem (15): 


min h{/3) p^{\j3i\) 

^ i=l 

s.t. ||XT(y-X/3)||oo <<5 

/3j = 0, i E 


(34) 


where Supp(/3^ ^) := {i : ^ 0,i = 1,... ,p} Cl, and is the complement of X. We assume, 


of course, that the feasible set in Problem (34) is nonempty. A simple method for constructing 


X, which we found to be quite useful in practice, is presented below. Let B C {1,... ,p} and 
denote its complement. We define the following set: 


T{B) := {/3 : ||X’^(y - X/3)||oo <S,Pi = 0,i€ B'^] . 


43 








Let ^ be the solutions produced by Algorithm 1. Suppose we let B denote the support of 

the size of B is typically much smaller than p. If J-{B) is nonempty, we take I = B. Note, 
however, that B{B) may be empty, because dS^\f3 are only approximately equal: ss (3 . 

In this case, we need expand the set B, so that the set B{B) becomes nonempty. There may 
be several ways to do this, but we found the following simple method to be quite useful in our 
numerical experiments. 

I. If F{B) is empty, we consider the set G B'^] and find the index of the largest 


element in this set, which we denote by: i G argmaxjg^c 

2. Make B larger by inclnding this new feature i: we thus have B 


BU{i}. 


3. Check if the resulting set B{B) is nonempty, if not, we repeat the above steps until B{B) 
becomes nonempty. 

4. We let X be the resulting set B obtained npon termination: I = B. 


Problem (34), which is an optimization problem with fewer variables than Problem (15) is found 


to deliver solutions that are better upper bounds to Problem ([^. This also leads to better and 
numerically more robust solutions than those available directly from Algorithm 1. The general al¬ 


gorithmic framework via sequential linear optimization, presented in Section 4.1.2 readily applies 


to obtain good npper bounds to Problem (34). 


B.6 Tighter bounds on /3j’s 


The bounds described via (22) can be sharpened by making nse of good upper bounds to the 
solution of Problem ([^. Towards this end, we need to reformulate ([^. Note that in Problem ([^, 
if we take Aijj to be to be sufficiently large then this will lead to a solution for Problem ([^. We 
rewrite Problem (IM as follows: 


mm 

/3,z,ct 

a 



p 


S.t. 

Zi< a 



i=l 



-6 <dj - <5, j = l,. 

■ • ,P 


-Muzj < Pj < Muzj, j = 1,. 

■ ■ ,P 



■ ,P, 


where the optimization variables are /3, z G and a G M. 


(35) 
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For a fixed a, consider the feasible set of Problem (35): 


Sr, — < 


Observe that 


where 


Tzj<a, ||XT(y-X/3)||oo <<5 

(/3,z): 

\Pj\ < M[/zj, Zj G {0,1}, j = 


Sa C Sa, 


Sr,= < 


y^Zj<a, ||xT(y-X/3)||oo <<5 


( 36 ) 


\/3j\ < Muzj, Zj G [0,1], j = 1,... ,p^ 

is obtained by relaxing the binary variables Zj G {0,1} into the continuous variables Zj G [0,1], for 


all } = 1,... ,p. Noting that Sa C Sa' for a < a'; and using this along with (36) we have 


Sa* ^ Sa() ^ SaQ^ 


where a* is the optimum value, and ao is an upper bound to Problem (35), and hence ao > a* 
The above inequality leads to the following chain of inequalities: 


min /; 
{P,z)&Sa* 

k> 

min /: 

{P,z)&Sag 

k > 

min /3 

(/3,z)e*Sc«o 

'i := Pi (ao) 

max /; 
(^,z)e5^* 

h < 

max /: 
(/3,z)e*SQQ 

h < 

max jSi 
(/3,z)e5ag 

:= ptiao)- 


The quantities at the right end above, i.e. p- (ao) and pf{ao), can be computed by solving a 
pair of linear optimization problems: 

^+(ao) := max /3j 

S.t. ||XT(y-X/3)||oo <<5, 

ll/3||oo < Aft/, 

||/3||i < Muao, 

(37) 

pr(ao):= nun /3j 

s.t. ||XT(y-X/3)||oo <<5, 

ll/3||oo < Aft/, 

||/3||i < Aft/ao- 
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The quantities /i“(ao) and fif (ao) are lower and upper bounds, respectively, for /3j — the bounds 
depend upon ao and A4u- Note that Hi{ao) '■= max (ao); “/^i~(“o)} provides an upper bound 
to |/3j|, which consequently leads to an improved estimate for ||/3||oo — this suggests a way to 
adaptively rehne A4u, and, thus, /xr(ao) and ^f{ao). 


C Additional Experiments 


This section complements the experimental results shown in the main body of the paper. Table 
is an elaborate version of the representative results displayed in Figure Here, we consider 
different values of SNR and also display the prediction errors. The results show that Discrete 
Dantzig Selector outperforms the ^i-Dantzig Selector based methods in terms of estimating the 
true underlying regression coefficients, and does so with better variable selection properties. 


Example-A (n = 200, j) = 500) 


Example-B (n = 200,}) = 500) 


Metric 

SNR 

LO-DS 

LO-DS-Pol 

Ll-DS 

Warm 

v-n 

3 

150.386 (8.993) 

64.17 (4.888) 137.073 (4.511) 

102.583 (5.711) 

Variable Selection Error 

3 

11 (0.247) 

16 (0.573) 

38 (1.266) 

14 (0.658) 

Prediction Error 

3 

0.155 (0.007) 

0.109 (0.003) 

0.132 (0.005) 

0.123 (0.005) 

Number of Nonzeros 

3 

16 (0.362) 

23 (0.362) 

46 (1.447) 

22 (0.693) 

whni 

10 

45.599 (2.508) 

13.773 (1.39) 

47.113 (1.016) 

38.256 (2.068) 

Variable Selection Error 

10 

9 (0.411) 

11 (0.482) 

36 (1.387) 

14 (0.724) 

Prediction Error 

10 

0.045 (0.001) 

0.052 (0.001) 

0.043 (0.001) 

0.035 (0.001) 

Number of Nonzeros 

10 

17 (0.151) 

23 (0.392) 

50 (1.266) 

26 (0.663) 


Metric 

SNR 

LO-DS 

LO-DS-PoI 

Ll-DS 

Warm 

ihn 

3 

3.858 (0.466) 1.123 (0.254) 

5.266 (0.162) 

3.429 (0.19) 

Variable Selection Error 

3 

6 (0.543) 

16 (0.844) 

102 (1.327) 

43 (1.327) 

Prediction Error 

3 

0.158 (0.013) 0.581 (0.038) 

0.18 (0.004) 

0.133 (0.004) 

Number of Nonzeros 

3 

24 (0.274) 

35 (0.663) 

122 (1.327) 

63 (1.266) 

IIM31I2 

10 

0.635 (0.053) 

0.172 (0) 

1.58 (0.049) 

0.664 (0.035) 

Variable Selection Error 

10 

1 (0.137) 

0(0) 

102 (1.327) 

29 (1.371) 

Prediction Error 

10 

0.031 (0.002) 

0.249 (0) 

0.054 (0.001) 

0.028 (0.001) 

Number of Nonzeros 

10 

21 (0.137) 

20 (0) 

122 (1.327) 

49 (1.371) 


Example-C (n 

= 100,}) = 500) 



Metric 

SNR 

LO-DS 

LO-DS-Pol 

Ll-DS 

Warm 

ihni 

3 

7.322 (0.65) 

6.823 (0.693) 6.452 (0.263) 7.674 (0.347) 

Variable Selection Error 

3 

10 (0.573) 

9 (0.814) 

43 (1.538) 

21 (1.116) 

Prediction Error 

3 

0.168 (0.015) 0.966 (0.051) 0.172 (0.004) 

0.193 (0.008) 

Number of Nonzeros 

3 

13 (0.271) 

13 (0.392) 

47 (1.658) 

22 (1.096) 

whn 

10 

2.395 (0.379) 

1.988 (0.333) 3.245 (0.097) 

2.529 (0.357) 

Variable Selection Error 

10 

10 (0.795) 

10 (0.795) 

38 (1.357) 

27 (1.343) 

Prediction Error 

10 

0.054 (0.004) 

0.515 (0.036) 

0.056 (0.002) 

0.051 (0.004) 

Number of Nonzeros 

10 

18 (0.685) 

18 (0.685) 

48 (1.357) 

35 (1.266) 


Metric 

SNR 

LO-DS 

LO-DS-Pol 

Ll-DS 

Warm 


3 

0.019 (0.006) 

0.015 (0) 

0.191 (0.027) 0.018 (0.007) 

Variable Selection Error 

3 

0(0) 

0(0) 

36 (0.693) 

12 (0.85) 

Prediction Error 

3 

0.033 (0.005) 

0.955 (0) 

0.101 (0.007) 

0.028 (0.003) 

Number of Nonzeros 

3 

2(0) 

2(0) 

38 (0.693) 

14(0.85) 

whni 

10 

0.005 (0.002) 

0.004 (0) 

0.057 (0.008) 

0.006 (0.002) 

Variable Selection Error 

10 

0(0) 

1 (0.082) 

36 (0.693) 

13 (0.877) 

Prediction Error 

10 

0.006 (0.002) 

0.527 (0.006) 

0.03 (0.002) 

0.007 (0.001) 

Number of Nonzeros 

10 

2(0) 

3 (0.082) 

38 (0.693) 

15 (0.877) 


Table 4: Tables showing the statistical performance of four different methods: “LO-DS”, “LO-DS-Pol” 


“Ll-DS” and “Warm”, described in Section 
computed in the same fashion as in Figure 


7 The standard errors are given in parentheses (they are 
5). The Discrete Dantzig Selector based methods deliver 
models with good accuracy in estimating the regression coefhcients, and the estimated models are sparser 
than those for the .^i-based method and the method based on Algorithm 2, which is a heuristic strategy 
to approximate good upper bounds for the Discrete Dantzig Selector problem. 


An important advantage of the Discrete Dantzig Selector based methods is that they deliver 
models that are very sparse. The polished version of the Discrete Dantzig Selector is found 
to exhibit better statistical performance than the original Discrete Dantzig Selector estimator. 
Table compares the polished versions of the Discrete Dantzig Selector and the £i-Dantzig 
Selector, and finds that the performance of the former approach is significantly better. 
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Example- 

■A (n = 

200,p = 500) 


Metric 

SNR 

LO-DS-Pol 

Ll-DS-Pol 

11/3-nii 

3 

1.123 (0.254) 

2.163 (0.139) 

Variable Selection Error 

3 

16 (0.844) 

30 (1.116) 

Prediction Error 

3 

0.581 (0.038) 

0.777 (0.02) 

Number of Nonzeros 

3 

35 (0.663) 

50 (1.146) 

11/3-/3*111 

10 

0.172 (0) 

0.282 (0.011) 

Variable Selection Error 

10 

0(0) 

11 (0.392) 

Prediction Error 

10 

0.249 (0) 

0.292 (0.003) 

Number of Nonzeros 

10 

20 (0) 

31 (0.392) 


Table 5: Tables comparing the polished version of Discrete Dantzig Selector with the polished version 
of £i-Dantzig Selector. The standard errors are given in parentheses. The statistical performance of 
£i-Dantzig Selector is inferior, most likely due to its weaker variable selection properties. 
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